Closed MosheOlshansky closed 9 years ago
Hi Moshe,
the problem is that Q is not yet able to read paired-end data. Q uses column two in the samfile which contains the 'sam flags'. For single-end data there is 0 (mapped on f-strand), 16 (mapped on r-strand) or 4 (unmapped). For paired-end-data there are different and more flags. See also here:
https://broadinstitute.github.io/picard/explain-flags.html
A possible is workaround is to convert the paired-end tags into single-end tags using samtools:
-> mapped on f-strand samtools view -h -F 0x10 $BAM_IN | samtools view -h -S -F 4 - | awk '{if($1 ~ /^@/){print $0}else{print $1"\t"0"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11}}' > $OUT_PREFIX.sam
-> mapped on r-strand samtools view -h -f 0x10 $BAM_IN | samtools view -h -S -F 4 - | awk '{if($1 !~ /^@/){print $1"\t"16"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11}}' >> $OUT_PREFIX.sam
samtools view -Sb $OUT_PREFIX.sam > $OUT_PREFIX.bam
I will teach Q to read paired-end data as soon as possible.
Best, Peter
Dear Peter,
Thank you very much for the prompt response.
When I fixed the flag in the SAM file Q worked as expected. Since my BAM file only contained reads which were successfully aligned (and passed some quality score threshold) that flag could be either 0 or 16. So below is my script:
samtools view -h IL4.bam | awk 'BEGIN {OFS=""; ORS=""} {if ($0 ~/^@/) {print $0,"\n"} else {for (i=1; i <= NF; i++) {x=$i; if (i==2) {x = 16*(int(x/16)%2)}; print x; if (i < NF) {print "\t"}}; print "\n"}}' > IL4_input.sam
Q works both when I keep only the first mate in each pair or both mates. To me it looks like it works better when I keep both mates - I am getting fewer significant peaks and this better fits the biology.
I also ran it with and without -k option. Once again, in my particular case it looks like I get better results with -k option. To be honest I do not understand the meaning of this option: if your statistic is based on the number of occupied positions, why does it matter if the duplicates are kept or not?
My other question is about the reasonable cutoff, i.e. what would be the reasonable cutoff after which the peaks are not considered reliable and should it be based on p/q - value or saturation score or both?
Finally, when Q knows to accept paired end data, one obvious thing would be to use actual fragments lengths to estimate mean length and variation.
But my question is whether you can do more, i.e. instead of extending each 3' position by the average fragment length you can extend it by it's exact length which you know. How would you define qfrags in such case? You can also try to compute your statistic based on the number of distinct fragments (and not 3' positions). It may be harder than the simple occupancy model but I think that it is doable. You definitely can estimate the distribution of fragments length and approximate it by some convenient distribution if needed. What do you think?
Best regards, Moshe.
From: Peter Hansen [notifications@github.com] Sent: Thursday, September 17, 2015 12:21 AM To: charite/Q Cc: Moshe Olshansky Subject: Re: [Q] Find no enriched regions (#5)
Hi Moshe,
the problem is that Q is not yet able to read paired-end data. Q uses column two in the samfile which contains the 'sam flags'. For single-end data there is 0 (mapped on f-strand), 16 (mapped on r-strand) or 4 (unmapped). For paired-end-data there are different and more flags. See also here:
https://broadinstitute.github.io/picard/explain-flags.html
A possible is workaround is to convert the paired-end tags into single-end tags using samtools:
mapped on f-strand
samtools view -h -F 0x10 $BAM_IN | samtools view -h -S -F 4 - | awk '{if($1 ~ /^@/){print $0}else{print $1"\t"0"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11}}' > $OUT_PREFIX.sam
mapped on r-strand
samtools view -h -f 0x10 $BAM_IN | samtools view -h -S -F 4 - | awk '{if($1 !~ /^@/){print $1"\t"16"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11}}' >> $OUT_PREFIX.sam
samtools view -Sb $OUT_PREFIX.sam > $OUT_PREFIX.bam
I will teach Q to read paired-end data as soon as possible.
Best, Peter
� Reply to this email directly or view it on GitHubhttps://github.com/charite/Q/issues/5#issuecomment-140755951.
"Q works both when I keep only the first mate in each pair or both mates. To me it looks like it works better when I keep both mates - I am getting fewer significant peaks and this better fits the biology."
I think this is because you have twice the data if you keep both mates. There is no reason to discard the second mate and I would keep both.
"I also ran it with and without -k option. Once again, in my particular case it looks like I get better results with -k option. To be honest I do not understand the meaning of this option: if your statistic is based on the number of occupied positions, why does it matter if the duplicates are kept or not?"
The '--keep-duplicate' option should not directly affect the evaluation of predicted binding sites, but indirectly. If you keep the duplicates the qfrags coverage profile will be different and therefore the positions of the predicted binding positions may be different to some extend and also the number of saturated positions in the immediate vicinity.
"My other question is about the reasonable cutoff, i.e. what would be the reasonable cutoff after which the peaks are not considered reliable and should it be based on p/q - value or saturation score or both?"
p- and q-values depend on the different null models they are based on (and on how the background of the particular datasets fit the null model). Null models are different for different peak callers and a predefined cutoff will always be kind of arbitrary. A generally accepted solotion to adress this issue is the framwork of the Irreproducible Discovery Rate (IDR) which was/is extensively used during the ENCODE project and is also discussed in the Q paper.
"Finally, when Q knows to accept paired end data, one obvious thing would be to use actual fragments lengths to estimate mean length and variation."
An automatically generated plot showing the real distribution of fragment length derived from the propper mapped pair would be really a nice feature and I will try to add it as soon as I can.
"But my question is whether you can do more, i.e. instead of extending each 3' position by the average fragment length you can extend it by it's exact length which you know. How would you define qfrags in such case? You can also try to compute your statistic based on the number of distinct fragments (and not 3' positions). It may be harder than the simple occupancy model but I think that it is doable. You definitely can estimate the distribution of fragments length and approximate it by some convenient distribution if needed. What do you think?"
Minor correction: 5' ends are extended towards 3' direction. I also think that the paired-end information could be exploited for prediction and evaluation of peaks, but this would be a different interesting project. I do not have a quick answer on the question if qfrags would be useful for an approach specially designed for paire-end data.
Thank you for your interest and interesting suggestions.
Best, Peter
Dear Peter,
Thank you for your explanations.
I can not use irreproducibility rate since I have only one replicate. I will play with this stuff to see whether having two replicates make enough difference to convince the lab people to do duplicates. So from your experience with the datasets you analysed, what would be a reasonable cutoff (in terms of p-value or q-score)?
I have several datasets with very high duplication rate (in some cases more than 90%). I think that this is due to the fact that they had a low amount of DNA and so a small number of unique fragments which were amplified very much. It wasn't clear to me how to deal with all these duplicated reads and so I also was thinking of looking at the number of occupied positions instead but I never had time to do more than a few simple calculations. So I was really excited to discover that someone else had a similar idea and produced a software package! So now I am trying to run Q on my datasets with hundreds of millions of reads and very high duplication rate. When I use -k (keep-duploicates) option on these datasets I do not get any significant peaks. Below are a few top lines of what I am getting:
"Chromosome" "pos" "pos+1" "q_cov_chip" "q_cov_ctrl" "q_beg_chip" "q_end_chip" "q_beg_ctrl" "q_end_ctrl" "saturation-score" "p-value" "q-value" chr5 30960003 30960004 93738 0 19 25 0 0 0.0873016 -1.0359e-18 -4.0798 chr5 28316897 28316898 123686 0 23 22 0 0 0.0892857 -1.0359e-18 -3.77877 chr5 28317416 28317417 70341 0 20 45 0 0 0.128968 -1.0359e-18 -3.60267 chr5 29435074 29435075 64342 0 17 21 0 0 0.0753968 -1.0359e-18 -3.47774 chr5 29735873 29735874 130491 0 21 23 0 0 0.0873016 -1.0359e-18 -3.38083
and all together I get about 12,000 peaks (instead of 100,000 - I do not change this). Is it due to some bug in the software (can not deal with that many qfrags) or should it be this way? Just in case, please find attached my quality file.
Best regards, Moshe.
From: Peter Hansen [notifications@github.com] Sent: Thursday, September 17, 2015 6:16 PM To: charite/Q Cc: Moshe Olshansky Subject: Re: [Q] Find no enriched regions (#5)
"Q works both when I keep only the first mate in each pair or both mates. To me it looks like it works better when I keep both mates - I am getting fewer significant peaks and this better fits the biology."
I think this is because you have twice the data if you keep both mates. There is no reason to discard the second mate and I would keep both.
"I also ran it with and without -k option. Once again, in my particular case it looks like I get better results with -k option. To be honest I do not understand the meaning of this option: if your statistic is based on the number of occupied positions, why does it matter if the duplicates are kept or not?"
The '--keep-duplicate' option should not directly affect the evaluation of predicted binding sites, but indirectly. If you keep the duplicates the qfrags coverage profile will be different and therefore the positions of the predicted binding positions may be different to some extend and also the number of saturated positions in the immediate vicinity.
"My other question is about the reasonable cutoff, i.e. what would be the reasonable cutoff after which the peaks are not considered reliable and should it be based on p/q - value or saturation score or both?"
p- and q-values depend on the different null models they are based on (and on how the background of the particular datasets fit the null model). Null models are different for different peak callers and a predefined cutoff will always be kind of arbitrary. A generally accepted solotion to adress this issue is the framwork of the Irreproducible Discovery Rate (IDR) which was/is extensively used during the ENCODE project and is also discussed in the Q paper.
"Finally, when Q knows to accept paired end data, one obvious thing would be to use actual fragments lengths to estimate mean length and variation."
An automatically generated plot showing the real distribution of fragment length derived from the propper mapped pair would be really a nice feature and I will try to add it as soon as I can.
"But my question is whether you can do more, i.e. instead of extending each 3' position by the average fragment length you can extend it by it's exact length which you know. How would you define qfrags in such case? You can also try to compute your statistic based on the number of distinct fragments (and not 3' positions). It may be harder than the simple occupancy model but I think that it is doable. You definitely can estimate the distribution of fragments length and approximate it by some convenient distribution if needed. What do you think?"
Minor correction: 5' ends are extended towards 3' direction. I also think that the paired-end information could be exploited for prediction and evaluation of peaks, but this would be a different interesting project. I do not have a quick answer on the question if qfrags would be useful for an approach specially designed for paire-end data.
Thank you for your interest and interesting suggestions.
Best, Peter
— Reply to this email directly or view it on GitHubhttps://github.com/charite/Q/issues/5#issuecomment-141000128.
"I can not use irreproducibility rate since I have only one replicate. I will play with this stuff to see whether having two replicates make enough difference to convince the lab people to do duplicates. So from your experience with the datasets you analysed, what would be a reasonable cutoff (in terms of p-value or q-score)?"
At least you could apply the IDR to pseudo replicates, which can be derived by splitting the dataset randomly in two halves.
"I have several datasets with very high duplication rate (in some cases more than 90%). I think that this is due to the fact that they had a low amount of DNA and so a small number of unique fragments which were amplified very much."
It might be that some experiments failed. How does the plot for binding characteristics look like? Is there only a single peak at one read length (phantom peak)? Then in addition to the high duplication rate the enrichment also did not work properly. Those experiments should be repeated. There is no point in sequencing additional reads, because of the high duplication rate. You would sequence over and over again the same sequences.
"When I use -k (keep-duploicates) option on these datasets I do not get any significant peaks."
If you know that there is a high duplication rate you should run Q without '-k'. Duplicates due to low complexity of the library won't yield additional information.
"all together I get about 12,000 peaks (instead of 100,000 - I do not change this). Is it due to some bug in the software (can not deal with that many qfrags) or should it be this way?"
100,000 is meant to be an upper limit. If there are less regions detected only those will be reported.
"Just in case, please find attached my quality file."
I could not find the attached file. You could compare duplication rate and Q's quality metrics QES and QSS for your data with that of about 400 ENCODE datasets. See here:
http://charite.github.io/Q/tutorial.html#statistics_and_quality_metrics
The version v1.1.0 is able to read paired-end data.
Hi,
I ran Q on two data sets. For one of them I got results which may be reasonable (I need to look at the details) but for the other one (which is actually a much better quality data) I got no peaks at all. If I run Q with default options it estimates the average fragment length to be 1 (it prints 2^32+1) but even when I provided the average fragment length it still has not found a single peak (even though there are hundreds of very reliable ones). The original data was paired ends and I thought that this could have created the problem (even though I do not know why) so I kept only the first mate in every pair (and so the strand is random) but it did not help. Any idea why this happens (and is there any way to avoid this)? I can upload my file (and control - it makes no difference).
Thank you in advance, Moshe.