cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium
https://cumc.github.io/xqtl-protocol/
MIT License
38 stars 42 forks source link

Generalization to accommodate leafcutter #283

Closed hsun3163 closed 2 years ago

hsun3163 commented 2 years ago

As gtex have already implement a way to conduct sQTL analysis via tensorQTL and leafcutter.

We can piped what we currently have into some of their code.

gtex conduct such analysis via a cluster_prepare_fastqtl.py wrapper, which takes care of both the calling, QC, and formatting step. What we should do is

ARCHIVE: No longer appropriate.


With the Leafcutter module being finished. It is time to ensure its output can be taken by TensorQTL/APEX directly. Potential Issue:

  1. TensorQTL require phenotype file to be named bed.gz
  2. Phenotype file end is not start+1, apex/tensorQTL may not like it.

As demonstrated below:

>>> phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hs3163/miniconda3/lib/python3.9/site-packages/tensorqtl/core.py", line 297, in read_phenotype_bed
    raise ValueError('Unsupported file type.')
ValueError: Unsupported file type.
>>> expression_bed = '/mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_tissue_analysis/MWE/QTL_association/sample_list_intron_usage_perind.counts.gz.qqnorm_14.bed.gz'
>>> phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hs3163/miniconda3/lib/python3.9/site-packages/tensorqtl/core.py", line 301, in read_phenotype_bed
    raise ValueError("The BED file must define the TSS/cis-window center, with start+1 == end.")
ValueError: The BED file must define the TSS/cis-window center, with start+1 == end.

Both of the aforementioned problems can be easily fixed by reading the phenotype file using a function other than tensorqtl.read_phenotype_bed, which is just some formatting of bed.gz

def read_phenotype_bed(phenotype_bed):
    """Load phenotype BED file as phenotype and TSS DataFrames"""
    if phenotype_bed.endswith('.bed.gz'):
        phenotype_df = pd.read_csv(phenotype_bed, sep='\t', index_col=3, dtype={'#chr':str, '#Chr':str})
    elif phenotype_bed.endswith('.parquet'):
        phenotype_df = pd.read_parquet(phenotype_bed)
        phenotype_df.set_index(phenotype_df.columns[3], inplace=True)
    else:
        raise ValueError('Unsupported file type.')
    phenotype_df.rename(columns={i:i.lower().replace('#chr','chr') for i in phenotype_df.columns[:3]}, inplace=True)
    # make sure TSS/cis-window is properly defined
    if not (phenotype_df['start']+1 == phenotype_df['end']).all():
        raise ValueError("The BED file must define the TSS/cis-window center, with start+1 == end.")
    phenotype_pos_df = phenotype_df[['chr', 'end']].rename(columns={'end':'tss'})
    phenotype_df.drop(['chr', 'start', 'end'], axis=1, inplace=True)
    return phenotype_df, phenotype_pos_df

However, a problem spurred from problem 2 is how do we define the cis-window size and center for the leafcutter feature.

Besides, some other questions remain:

does phenotype still needs to be factor analyzed (No, we use the rnaseq factor) PCs may need different merging codes, depending on the format (No leafcutter pcs)

hsun3163 commented 2 years ago

Hi @gaow, as it turns out, the PC is estimated from a matrix of normalized proportion (written to the per chrom output qqnorm_*.gz file ), so we should compute an independent set of factors based on it. However, I want to make sure that, similar to the expression data, we are using the residual of that matrix because we want to exclude the PCs from the genomics background and the known covariates right?

gaow commented 2 years ago

@hsun3163 yes to your question.

We need to QC on the splicing calls before doing that. We have decided to use QC same as with expression but we need to increase the minimum reads filter (at least 30 million; current default is 10 million) because we need enough reads to call splicing events reliably. This can be done when we analyze both eQTL and sQTL data we can use the eQTL phenotype QC pipeline. For now, let's assume it's QC-ed you can write the sQTL calling pipeline -- just dont write the command generator yet for sQTL calling

hsun3163 commented 2 years ago

ARCHIVE: See the main post


@hsun3163 yes to your question.

We need to QC on the splicing calls before doing that. We have decided to use QC same as with expression but we need to increase the minimum reads filter (at least 30 million; current default is 10 million) because we need enough reads to call splicing events reliably. This can be done when we analyze both eQTL and sQTL data we can use the eQTL phenotype QC pipeline. For now, let's assume it's QC-ed you can write the sQTL calling pipeline -- just dont write the command generator yet for sQTL calling

As I see it, the majority of this pipeline should be the same as that of the eQTL pipeline, though, besides the first step that fixes whatever formatting issue, the residual and factor analysis, .etc.

Alternatively, if we decide to change

Phenotype file end is not start+1, apex/tensorQTL may not like it.

in the tensorQTL module directly, all the modules should be the same.


With the current understanding on what to do with the splicing data, and that the output of leafcutter is per chrom, I think the entry point should be a module that


Alternatively, and I am leaning toward this approach, for the last step of the leafcutter module, we can

So that the entry point of sQTL calling pipeline only needs to

hsun3163 commented 2 years ago

splicing post-processing done in the gene annotation notebook for both leafcutter and psichomics