Generalization to accommodate leafcutter

hsun3163 commented 2 years ago

As gtex have already implement a way to conduct sQTL analysis via tensorQTL and leafcutter.

We can piped what we currently have into some of their code.

gtex conduct such analysis via a cluster_prepare_fastqtl.py wrapper, which takes care of both the calling, QC, and formatting step. What we should do is

[x] Extract the QC and modify formatting code from cluster_prepare_fastqtl.py
[x] Figure out how to generate the ${prefix}.leafcutter.phenotype_groups.txt file
[x] Change our tensorQTL cis module to accommodate the phenotype_group option
[x] Change our module to accommodate a different output format from the phenotype_group
[x] gene_ID is not an accurate description of the features, which will motivate us to use molecular_trait_id instead of gene_id throughout our modules. ()

ARCHIVE: No longer appropriate.

With the Leafcutter module being finished. It is time to ensure its output can be taken by TensorQTL/APEX directly. Potential Issue:

TensorQTL require phenotype file to be named bed.gz
Phenotype file end is not start+1, apex/tensorQTL may not like it.

As demonstrated below:

>>> phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hs3163/miniconda3/lib/python3.9/site-packages/tensorqtl/core.py", line 297, in read_phenotype_bed
    raise ValueError('Unsupported file type.')
ValueError: Unsupported file type.
>>> expression_bed = '/mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_tissue_analysis/MWE/QTL_association/sample_list_intron_usage_perind.counts.gz.qqnorm_14.bed.gz'
>>> phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hs3163/miniconda3/lib/python3.9/site-packages/tensorqtl/core.py", line 301, in read_phenotype_bed
    raise ValueError("The BED file must define the TSS/cis-window center, with start+1 == end.")
ValueError: The BED file must define the TSS/cis-window center, with start+1 == end.

Both of the aforementioned problems can be easily fixed by reading the phenotype file using a function other than tensorqtl.read_phenotype_bed, which is just some formatting of bed.gz

def read_phenotype_bed(phenotype_bed):
    """Load phenotype BED file as phenotype and TSS DataFrames"""
    if phenotype_bed.endswith('.bed.gz'):
        phenotype_df = pd.read_csv(phenotype_bed, sep='\t', index_col=3, dtype={'#chr':str, '#Chr':str})
    elif phenotype_bed.endswith('.parquet'):
        phenotype_df = pd.read_parquet(phenotype_bed)
        phenotype_df.set_index(phenotype_df.columns[3], inplace=True)
    else:
        raise ValueError('Unsupported file type.')
    phenotype_df.rename(columns={i:i.lower().replace('#chr','chr') for i in phenotype_df.columns[:3]}, inplace=True)
    # make sure TSS/cis-window is properly defined
    if not (phenotype_df['start']+1 == phenotype_df['end']).all():
        raise ValueError("The BED file must define the TSS/cis-window center, with start+1 == end.")
    phenotype_pos_df = phenotype_df[['chr', 'end']].rename(columns={'end':'tss'})
    phenotype_df.drop(['chr', 'start', 'end'], axis=1, inplace=True)
    return phenotype_df, phenotype_pos_df

However, a problem spurred from problem 2 is how do we define the cis-window size and center for the leafcutter feature.

Besides, some other questions remain:

does phenotype still needs to be factor analyzed (No, we use the rnaseq factor) PCs may need different merging codes, depending on the format (No leafcutter pcs)

hsun3163 commented 2 years ago

Hi @gaow, as it turns out, the PC is estimated from a matrix of normalized proportion (written to the per chrom output qqnorm_*.gz file ), so we should compute an independent set of factors based on it. However, I want to make sure that, similar to the expression data, we are using the residual of that matrix because we want to exclude the PCs from the genomics background and the known covariates right?

gaow commented 2 years ago

@hsun3163 yes to your question.

We need to QC on the splicing calls before doing that. We have decided to use QC same as with expression but we need to increase the minimum reads filter (at least 30 million; current default is 10 million) because we need enough reads to call splicing events reliably. This can be done when we analyze both eQTL and sQTL data we can use the eQTL phenotype QC pipeline. For now, let's assume it's QC-ed you can write the sQTL calling pipeline -- just dont write the command generator yet for sQTL calling

hsun3163 commented 2 years ago

ARCHIVE: See the main post

@hsun3163 yes to your question.

We need to QC on the splicing calls before doing that. We have decided to use QC same as with expression but we need to increase the minimum reads filter (at least 30 million; current default is 10 million) because we need enough reads to call splicing events reliably. This can be done when we analyze both eQTL and sQTL data we can use the eQTL phenotype QC pipeline. For now, let's assume it's QC-ed you can write the sQTL calling pipeline -- just dont write the command generator yet for sQTL calling

As I see it, the majority of this pipeline should be the same as that of the eQTL pipeline, though, besides the first step that fixes whatever formatting issue, the residual and factor analysis, .etc.

Alternatively, if we decide to change

Phenotype file end is not start+1, apex/tensorQTL may not like it.

in the tensorQTL module directly, all the modules should be the same.

With the current understanding on what to do with the splicing data, and that the output of leafcutter is per chrom, I think the entry point should be a module that

[ ] take a list of phenotype,
[ ] merged them into 1 matrix,
[ ] change the end column,
[ ] also output the region list
[ ] bgzip & tabix

Alternatively, and I am leaning toward this approach, for the last step of the leafcutter module, we can

[ ] cat all the qqnorm file
[ ] any QC

So that the entry point of sQTL calling pipeline only needs to

[ ] Fix the sample names, now it is

#Chr    start   end     ID      10518782Aligned.sortedByCoord.out       11395417Aligned.sortedByCoord.out       20214850Aligned.sortedByCoord.out

[ ] change the end column,
[ ] also output the region list
[ ] bgzip & tabix

hsun3163 commented 2 years ago

splicing post-processing done in the gene annotation notebook for both leafcutter and psichomics

cumc / xqtl-protocol

Generalization to accommodate leafcutter #283

ARCHIVE: No longer appropriate.

ARCHIVE: See the main post