cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium
https://cumc.github.io/xqtl-protocol/
MIT License
38 stars 42 forks source link

Merge multiple sample RNA-seq calling results #94

Closed hsun3163 closed 2 years ago

hsun3163 commented 2 years ago

At the moment the bulk_expression module use the output of run_RSEM.py as the final product. However, the output of RSEM, as followed:

sample_name.genes.results
File containing gene level expression estimates. The first line contains column names separated by the tab character. The format of each line in the rest of this file is:

gene_id transcript_id(s) length effective_length expected_count TPM FPKM [posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM TPM_ci_lower_bound TPM_ci_upper_bound TPM_coefficient_of_quartile_variation FPKM_ci_lower_bound FPKM_ci_upper_bound FPKM_coefficient_of_quartile_variation]

Fields are separated by the tab character. Fields within "[]" are optional. They will not be presented if neither '--calc-pme' nor '--calc-ci' is set.

'transcript_id(s)' is a comma-separated list of transcript_ids belonging to this gene. If no gene information is provided, 'gene_id' and 'transcript_id(s)' are identical (the 'transcript_id').

A gene's 'length' and 'effective_length' are defined as the weighted average of its transcripts' lengths and effective lengths (weighted by 'IsoPct'). A gene's abundance estimates are just the sum of its transcripts' abundance estimates.

is not what is needed by the downstream analysis, (i.e a geneCounts/TPM table).

Therefore, once the current fixing is done, two things needs to be added to the module:

  1. Parallelization
  2. Merging the output of this module to produce the whole sample matrix.

Additional thought: If we gonna run the module for {number of samples} times. Maybe we should ask the user to get the adaptor sequence from the manufacturer instead of asking them to figure it out from the fastqc report. We can keep the fastqc steps though just as additional info.

gaow commented 2 years ago

@floutt to bring you up to date: based on your MWE we are able to get it to the point before QC. However we noticed some output postprocessing issues and we need to test it with multiple samples. Do you think you could make a MWE for another sample so we can test with 2 samples? Or maybe 3? 3 definitely is enough.

When you are done please upload that to the same Google Drive folder. Thank you!

gaow commented 2 years ago

@hsun3163 I've reviewed and finalized the multi-sample workflow without merging the results. Please reproduce them here:

https://cumc.github.io/xqtl-pipeline/pipeline/molecular_phenotypes/bulk_expression.html

Could you finish it up by merging the output to produce the sample matrix, so that:

  1. you don't lose information in the process
  2. the output is compatible with the TPM count QC workflow input

Thanks!

hsun3163 commented 2 years ago

you don't lose information in the process the output is compatible with the TPM count QC workflow input The output of rnaseqQC is

#1.12
number of genes
Name | Description | sample_ID (actual data)
ENSG...      AAA          0.0

In order for the output to be compatible with TPM countQC workflow, it has to take the format of

gene_ID | Sample1 | Sample2 | Sample3
ENSG...    |  0.0   | 0.02   | 1.1

Which do not have rooms for the Description columns.

As the description columns is just the name of the gene, which were not used in the downstream analysis and can be easily obtained later on I wonder is it necessary to keep it?

The same goes with the number of genes information, on one hand this info can be easily obtained by wc *.gct, on the other hand some genes may also be filter out in the subsequent follow-up step though.

So I wonder is it necessary to keep the Description column and the number of gene note?

gaow commented 2 years ago

@hsun3163 we can comment out #number of genes? Or simply remove it I'm find with that. For Description shall we keep it and change the TPM countQC workflow to ignore it? It's nice to have the wider used gene name readily accessible in the output.

hsun3163 commented 2 years ago

@hsun3163 we can comment out #number of genes? Or simply remove it I'm find with that. For Description shall we keep it and change the TPM countQC workflow to ignore it? It's nice to have the wider used gene name readily accessible in the output.

It is easy to changes qc pipeline, but the gene name cannot pass the normalization step though. Also, I remember both APEX and tensorQTL are quite picky about their input.

Also our final output is vcf per gene. So, if we do want to get the gene name readied in the output. We should add a step to tag it in the vcf.

gaow commented 2 years ago

@hsun3163 okay let's drop it for now. We might want to reannotate it after we are done -- i agree it should be easy once in VCF format