afombravo / 2FAST2Q

A Python3 program that counts sequence occurrences in raw FASTQ files.
GNU General Public License v3.0
6 stars 2 forks source link

2FAST2Q not counting correctly? #8

Closed th-of closed 7 months ago

th-of commented 7 months ago

I wrote my own program to count the occurrences of sgRNA sequences in fastq files before I found 2FAST2Q. When trying your program, however, the counts I get are different from my own program. After investigating this discrepancy, I suspect something wrong with 2FAST2Q?

Here is an example fastq file: A1_1.zip

And let's consider the guide sequence "TGTAAAGCTCAGTATCTTTT"

If I simply count exact occurences of this sequence in the fastq file using grep and wc

grep "TGTAAAGCTCAGTATCTTTT" A1_1.fastq | wc -l

Gives me the number 118.

I can also verify that the average phred score of each one is above 30, I can also confirm that all occurences are at the expected location (offset) in all cases. I know that 2FAST2Q allows one mismatch, which only means that the count from 2FAST2Q should always be equal to or larger than exact occurences. In the output from 2FAST2Q (default settings), in the table "Compiled.csv" I get a count of 63 for this guide sequence for sample/file A1_1. Why is this? Is this discrepancy reproducible for you/others?

There is also "Minimal upstream sequence Phred-score", I'm not sure what this means but setting it to 0 made to difference.

In fact, the majority of counts are quite far off.

Results from 2FAST2Q: image

Results from my program: image

If I have time I will go through the source code for 2FAST2Q. Versions 2.5.5 and 2.5.6 has the same issue

2FAST2Q version: 2.5.5

Mismatch: 1

Phred Score: 30

Feature Length: 20

Feature start position in the read: 54

Running mode: C

Upstream search sequence: None

Downstream search sequence: None

Mismatches in the upstream search sequence: 0

Mismatches in the downstream search sequence: 0

Minimal Phred-score in the upstream search sequence: 30

Minimal Phred-score in the downstream search sequence: 30

afombravo commented 7 months ago

Hi.

After a quick check i confirm that 2fast2q is working properly. when using the following command:

" python -m fast2q -c --st 54 --m 0 --ph 0 "

I get exactly 118 occurrences of "TGTAAAGCTCAGTATCTTTT"

When using:

" python -m fast2q -c --st 54 --m 0 --ph 30 "

I get the 63 mentioned occurances.

The difference here relates ONLY to the phred score filtering. 2fast2q does NOT use average phred score, it uses single bp phred score, meaning that if even 1 bp is not in the required range, the entire sequence is dropped. This is what seems to be happening here, and the source of all confusion.

2fast2q is optimized for small sequences with single bp variations where mistakes in a single nucleotide can indicate a significant experimental difference, hence this feature.

Cheers, Afonso

P.S. I noticed you openned a github issue. i will post this reply there as well and close this non issue.


De: th-of @.> Enviado: 30 de janeiro de 2024 16:18 Para: afombravo/2FAST2Q @.> Cc: Subscribed @.***> Assunto: [afombravo/2FAST2Q] 2FAST2Q not counting correctly? (Issue #8)

I wrote my own program to count the occurrences of sgRNA sequences in fastq files before I found 2FAST2Q. When trying your program, however, the counts I get are different from my own program. After investigating this discrepancy, I suspect something wrong with 2FAST2Q?

Here is an example fastq file: A1_1.ziphttps://github.com/afombravo/2FAST2Q/files/14100484/A1_1.zip

And let's consider the guide sequence "TGTAAAGCTCAGTATCTTTT"

If I simply count exact occurences of this sequence in the fastq file using grep and wc

grep "TGTAAAGCTCAGTATCTTTT" A1_1.fastq | wc -l

Gives me the number 118.

I can also verify that the average phred score of each one is above 30, I can also confirm that all occurences are at the expected location (offset) in all cases. I know that 2FAST2Q allows one mismatch, which only means that the count from 2FAST2Q should always be equal to or larger than exact occurences. In the output from 2FAST2Q (default settings), in the table "Compiled.csv" I get a count of 63 for this guide sequence for sample/file A1_1. Why is this? Is this discrepancy reproducible for you/others?

There is also "Minimal upstream sequence Phred-score", I'm not sure what this means but setting it to 0 made to difference.

In fact, the majority of counts are quite far off.

Results from 2FAST2Q: image.png (view on web)https://github.com/afombravo/2FAST2Q/assets/46278528/09f4b074-f7f0-4ab2-b302-20a060178f64

Results from my program: image.png (view on web)https://github.com/afombravo/2FAST2Q/assets/46278528/da9e46e1-123a-464f-82c4-ce76912573dc

If I have time I will go through the source code for 2FAST2Q. Versions 2.5.5 and 2.5.6 has the same issue

2FAST2Q version: 2.5.5

Mismatch: 1

Phred Score: 30

Feature Length: 20

Feature start position in the read: 54

Running mode: C

Upstream search sequence: None

Downstream search sequence: None

Mismatches in the upstream search sequence: 0

Mismatches in the downstream search sequence: 0

Minimal Phred-score in the upstream search sequence: 30

Minimal Phred-score in the downstream search sequence: 30

— Reply to this email directly, view it on GitHubhttps://github.com/afombravo/2FAST2Q/issues/8, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQWQZR74UIHCWSQOBML3XETYREMPFAVCNFSM6AAAAABCRRHPB6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYDQMRVGM3DMNI. You are receiving this because you are subscribed to this thread.

th-of commented 7 months ago

Thank you for clarifying the implementation of phred score filtering in 2FAST2Q. Although it's very odd to me that you want to discard counts because of a single low quality base, but then subsequently allow one mismatch. The low quality base is far more likely to be real than a high quality base you know is wrong. The probability of a guide sequence erroneously matching to a read because of a sequencing error is exceedingly low. I can't see a situation where this cause a significant experimental difference.

afombravo commented 7 months ago

Those parameters are just the default that made the most sense for the experimental setup 2FAST2Q was designed for. You are more than free, and encouraged, to change them to whatever you want and/or need.

"The low quality base is far more likely". This is incorrect. Lets take for example a phred score of 10. this means that thare is a 10% chance of a given nucleotide being incorrect. on a 20bp long sequence, this means 2 nucleotides will likely be wrong. This is absolutely not insignificant, and such probabilities altogether in the millions of sequences one can get can sum up to give significant differences. This is of course dataset dependent, but 2fast2q tries to be generalist.

"The probability of a guide sequence erroneously matching to a read because of a sequencing error is exceedingly low." This is also incorrect. take for example a guide that differs from another by 1 or 2 bp. if we assume a minimal phred score of 10 (again), this means those 2 guides could, with some probability, be mistaken by one another, possibly creating errors in the final counting. This differentiation is a very important use case for 2fast2q, one that is actually used by some users.

I don't see the point of this discussion here. Like I said, you are free to easilly change the parameters to whatever you see fit.

Best regards, Afonso


De: th-of @.> Enviado: 31 de janeiro de 2024 08:32 Para: afombravo/2FAST2Q @.> Cc: afombravo @.>; State change @.> Assunto: Re: [afombravo/2FAST2Q] 2FAST2Q not counting correctly? (Issue #8)

Thank you for clarifying the implementation of phred score filtering in 2FAST2Q. Although it's very odd to me that you want to discard counts because of a single low quality base, but then subsequently allow one mismatch. The low quality base is far more likely to be real than a high quality base you know is wrong. The probability of a guide sequence erroneously matching to a read because of a sequencing error is exceedingly low. I can't see a situation where this cause a significant experimental difference.

— Reply to this email directly, view it on GitHubhttps://github.com/afombravo/2FAST2Q/issues/8#issuecomment-1918624188, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQWQZR6IYSZI32V7VZWTWF3YRH6SVAVCNFSM6AAAAABCRRHPB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJYGYZDIMJYHA. You are receiving this because you modified the open/close state.