Closed hsun3163 closed 2 years ago
@floutt to bring you up to date: based on your MWE we are able to get it to the point before QC. However we noticed some output postprocessing issues and we need to test it with multiple samples. Do you think you could make a MWE for another sample so we can test with 2 samples? Or maybe 3? 3 definitely is enough.
When you are done please upload that to the same Google Drive folder. Thank you!
@hsun3163 I've reviewed and finalized the multi-sample workflow without merging the results. Please reproduce them here:
https://cumc.github.io/xqtl-pipeline/pipeline/molecular_phenotypes/bulk_expression.html
Could you finish it up by merging the output to produce the sample matrix, so that:
Thanks!
you don't lose information in the process the output is compatible with the TPM count QC workflow input The output of rnaseqQC is
#1.12 number of genes Name | Description | sample_ID (actual data) ENSG... AAA 0.0
In order for the output to be compatible with TPM countQC workflow, it has to take the format of
gene_ID | Sample1 | Sample2 | Sample3
ENSG... | 0.0 | 0.02 | 1.1
Which do not have rooms for the Description columns.
As the description columns is just the name of the gene, which were not used in the downstream analysis and can be easily obtained later on I wonder is it necessary to keep it?
The same goes with the number of genes
information, on one hand this info can be easily obtained by wc *.gct
, on the other hand some genes may also be filter out in the subsequent follow-up step though.
So I wonder is it necessary to keep the Description
column and the number of gene
note?
@hsun3163 we can comment out #number of genes
? Or simply remove it I'm find with that. For Description
shall we keep it and change the TPM countQC workflow to ignore it? It's nice to have the wider used gene name readily accessible in the output.
@hsun3163 we can comment out
#number of genes
? Or simply remove it I'm find with that. ForDescription
shall we keep it and change the TPM countQC workflow to ignore it? It's nice to have the wider used gene name readily accessible in the output.
It is easy to changes qc pipeline, but the gene name cannot pass the normalization step though. Also, I remember both APEX and tensorQTL are quite picky about their input.
Also our final output is vcf per gene. So, if we do want to get the gene name readied in the output. We should add a step to tag it in the vcf.
@hsun3163 okay let's drop it for now. We might want to reannotate it after we are done -- i agree it should be easy once in VCF format
At the moment the bulk_expression module use the output of run_RSEM.py as the final product. However, the output of RSEM, as followed:
is not what is needed by the downstream analysis, (i.e a geneCounts/TPM table).
Therefore, once the current fixing is done, two things needs to be added to the module:
Additional thought: If we gonna run the module for {number of samples} times. Maybe we should ask the user to get the adaptor sequence from the manufacturer instead of asking them to figure it out from the fastqc report. We can keep the fastqc steps though just as additional info.