Closed hsun3163 closed 2 years ago
Hi @gaow, as it turns out, the PC is estimated from a matrix of normalized proportion (written to the per chrom output qqnorm_*.gz file ), so we should compute an independent set of factors based on it. However, I want to make sure that, similar to the expression data, we are using the residual of that matrix because we want to exclude the PCs from the genomics background and the known covariates right?
@hsun3163 yes to your question.
We need to QC on the splicing calls before doing that. We have decided to use QC same as with expression but we need to increase the minimum reads filter (at least 30 million; current default is 10 million) because we need enough reads to call splicing events reliably. This can be done when we analyze both eQTL and sQTL data we can use the eQTL phenotype QC pipeline. For now, let's assume it's QC-ed you can write the sQTL calling pipeline -- just dont write the command generator yet for sQTL calling
@hsun3163 yes to your question.
We need to QC on the splicing calls before doing that. We have decided to use QC same as with expression but we need to increase the minimum reads filter (at least 30 million; current default is 10 million) because we need enough reads to call splicing events reliably. This can be done when we analyze both eQTL and sQTL data we can use the eQTL phenotype QC pipeline. For now, let's assume it's QC-ed you can write the sQTL calling pipeline -- just dont write the command generator yet for sQTL calling
As I see it, the majority of this pipeline should be the same as that of the eQTL pipeline, though, besides the first step that fixes whatever formatting issue, the residual and factor analysis, .etc.
Alternatively, if we decide to change
Phenotype file end is not start+1, apex/tensorQTL may not like it.
in the tensorQTL module directly, all the modules should be the same.
With the current understanding on what to do with the splicing data, and that the output of leafcutter is per chrom, I think the entry point should be a module that
end
column,Alternatively, and I am leaning toward this approach, for the last step of the leafcutter module, we can
So that the entry point of sQTL calling pipeline only needs to
#Chr start end ID 10518782Aligned.sortedByCoord.out 11395417Aligned.sortedByCoord.out 20214850Aligned.sortedByCoord.out
end
column,splicing post-processing done in the gene annotation notebook for both leafcutter and psichomics
As gtex have already implement a way to conduct sQTL analysis via tensorQTL and leafcutter.
We can piped what we currently have into some of their code.
gtex conduct such analysis via a cluster_prepare_fastqtl.py wrapper, which takes care of both the calling, QC, and formatting step. What we should do is
[x] Extract the QC and modify formatting code from cluster_prepare_fastqtl.py
[x] Figure out how to generate the ${prefix}.leafcutter.phenotype_groups.txt file
[x] Change our tensorQTL cis module to accommodate the phenotype_group option
[x] Change our module to accommodate a different output format from the phenotype_group
[x]
gene_ID
is not an accurate description of the features, which will motivate us to usemolecular_trait_id
instead ofgene_id
throughout our modules. ()ARCHIVE: No longer appropriate.
With the Leafcutter module being finished. It is time to ensure its output can be taken by TensorQTL/APEX directly. Potential Issue:
As demonstrated below:
Both of the aforementioned problems can be easily fixed by reading the phenotype file using a function other than tensorqtl.read_phenotype_bed, which is just some formatting of bed.gz
However, a problem spurred from problem 2 is how do we define the cis-window size and center for the leafcutter feature.
Besides, some other questions remain:
does phenotype still needs to be factor analyzed (No, we use the rnaseq factor) PCs may need different merging codes, depending on the format (No leafcutter pcs)