bmvdgeijn / WASP

WASP: allele-specific pipeline for unbiased read mapping and molecular QTL discovery
Apache License 2.0
103 stars 51 forks source link

constructing input files for pairwise eQTL testing #95

Open ghost opened 4 years ago

ghost commented 4 years ago

Hi,

I am interested in identifying significant eQTLs given n phased SNPs and m genes with allele-specific reads.

I studied the get_target_regions script, which doesn't have the option of specifying TSS, so I made my own input files. It contains m x n rows with all possible associations between the test SNPs and genes, and the reference/alternate read counts of the genes depend on the particular row's test SNP.

My question is that fit_as_coefficients seems to assume that the # of rows = # of genes, or in other words, each gene must unique ref/alt read counts to estimate the dispersion parameter. Because my input files have multiple genes with different ref/alt reads, this seems to adversely affect the estimation performance.

Do you have any suggestions for handling this? Am I correct in saying that the provided pipeline is not suitable for exhaustive pairwise eQTL testing?

Thank you so much for reading. I would really appreciate any input and feedback

gmcvicker commented 4 years ago

For the fit_as_coefficients script you should only include a unique set of target regions (in your case it sounds like TSSs). This script estimates dispersion coefficients for allele-specific expression, which are then used as input for the combined haplotype test.

The combined haplotype test does allow testing of multiple SNPs against a single target region.

Hope that helps clarify,

Graham

ghost commented 4 years ago

Thank you for your reply! What do you mean by TSSs? Given that I have multiple rows of ref/alt counts per gene that depend on the test SNP's phasing, which row should I select to compose the input for fit_as_coefficients?

gmcvicker commented 4 years ago

By TSSs I meant transcription start sites because from your initial inquiry I thought this is what you were using as target regions.

Also after re-reading your original question and thinking about it some more I realize that I made a mistake when I said that the fit_as_coefficients required unique target regions and I should clarify some things: 1) fit_as_coefficients can take a file as input in which a single target region is repeated because there are multiple rows with different 'test' SNPs being tested for association with the target region. 2) The input for fit_as_coefficients should be generated by extract_haplotype_read_counts.py 3) The input file for extract_haplotype_read_counts.py can be generated by your own script, and it is fine to include target regions multiple times (i.e. on different rows). The format is specified in the README. 4) A target region can span multiple discontiguous genome segments (e.g. exons of a gene), but a single target region should not have overlapping segments. For example, if you are using exons of a gene, you should make sure that you are not including duplicated or overlapping exons (e.g. from multiple transcript spliceforms) in the same target region.

ghost commented 4 years ago

Wouldn't repeating the same target region multiple rows affect fit_as_coefficients estimation?

My understanding is that if there were 2 test SNPs (1|0) and (1|1), where 0=ref & 1=alt, and ASE reads 6|3 (also phased), the input file have two rows with the reference and alternate read counts of 3|6 and 0|9. fit_as_coefficients would read both lines and treat 3|6 and 0|9 as separate target region.

Thank you so much for your prompt replies and clarification!

gmcvicker commented 4 years ago

I agree that there will be data duplication due to this issue, however, as long as there is no single target region that is massively overrepresented in the dataset, I don't think that this is a problem.