Open ghost opened 4 years ago
For the fit_as_coefficients script you should only include a unique set of target regions (in your case it sounds like TSSs). This script estimates dispersion coefficients for allele-specific expression, which are then used as input for the combined haplotype test.
The combined haplotype test does allow testing of multiple SNPs against a single target region.
Hope that helps clarify,
Graham
Thank you for your reply! What do you mean by TSSs? Given that I have multiple rows of ref/alt counts per gene that depend on the test SNP's phasing, which row should I select to compose the input for fit_as_coefficients?
By TSSs I meant transcription start sites because from your initial inquiry I thought this is what you were using as target regions.
Also after re-reading your original question and thinking about it some more I realize that I made a mistake when I said that the fit_as_coefficients required unique target regions and I should clarify some things: 1) fit_as_coefficients can take a file as input in which a single target region is repeated because there are multiple rows with different 'test' SNPs being tested for association with the target region. 2) The input for fit_as_coefficients should be generated by extract_haplotype_read_counts.py 3) The input file for extract_haplotype_read_counts.py can be generated by your own script, and it is fine to include target regions multiple times (i.e. on different rows). The format is specified in the README. 4) A target region can span multiple discontiguous genome segments (e.g. exons of a gene), but a single target region should not have overlapping segments. For example, if you are using exons of a gene, you should make sure that you are not including duplicated or overlapping exons (e.g. from multiple transcript spliceforms) in the same target region.
Wouldn't repeating the same target region multiple rows affect fit_as_coefficients estimation?
My understanding is that if there were 2 test SNPs (1|0) and (1|1), where 0=ref & 1=alt, and ASE reads 6|3 (also phased), the input file have two rows with the reference and alternate read counts of 3|6 and 0|9. fit_as_coefficients would read both lines and treat 3|6 and 0|9 as separate target region.
Thank you so much for your prompt replies and clarification!
I agree that there will be data duplication due to this issue, however, as long as there is no single target region that is massively overrepresented in the dataset, I don't think that this is a problem.
Hi,
I am interested in identifying significant eQTLs given n phased SNPs and m genes with allele-specific reads.
I studied the
get_target_regions
script, which doesn't have the option of specifying TSS, so I made my own input files. It contains m x n rows with all possible associations between the test SNPs and genes, and the reference/alternate read counts of the genes depend on the particular row's test SNP.My question is that
fit_as_coefficients
seems to assume that the # of rows = # of genes, or in other words, each gene must unique ref/alt read counts to estimate the dispersion parameter. Because my input files have multiple genes with different ref/alt reads, this seems to adversely affect the estimation performance.Do you have any suggestions for handling this? Am I correct in saying that the provided pipeline is not suitable for exhaustive pairwise eQTL testing?
Thank you so much for reading. I would really appreciate any input and feedback