Psy-Fer / buttery-eel

The buttery eel - a slow5 guppy/dorado basecaller wrapper
MIT License
34 stars 2 forks source link

Key error - 'mean_qscore' #32

Closed denisbeslic closed 1 month ago

denisbeslic commented 8 months ago

Hello,

I'm trying to basecall some reads I generated with squigulator.

buttery-eel -i results/zymo-human/passed-1K.blow5 -o results/zymo-human/passed-1K-squigulator.fastq -x cuda:all -g resources/ont-guppy-6_5_7/bin/ --config dna_r10.4.1_e8.2_400bps_sup.cfg --port 8020 --slow5_threads 128

When trying to basecall with buttery-eel I get the following error: An exception occurred: KeyError - 'mean_qscore' The fastq file is empty.

The header of the fastq for the input of squigulator looked like this: >81438ac8-1858-44d9-a8de-cec4e7e26adc_parent_read_id=81438ac8-1858-44d9-a8de-cec4e7e26adc_model_version_id=dna_r10.4.1_e8.2_400bps_sup-v4.1.0_mean_qscore=6 I removed the additional information such that the header looks like this: >81438ac8-1858-44d9-a8de-cec4e7e26adc_parent_read_id=81438ac8-1858-44d9-a8de-cec4e7e26adc However, I still get the same error.

Psy-Fer commented 8 months ago

Hey,

Could you please send me a blow5 file with a read that fails? I'll try to reproduce the error.

Cheers, James

denisbeslic commented 8 months ago

Dear James,

thank you for your help. Enclosed is a blow5 file with 500 reads, which throws the error for me. I did some small tests using smaller portions of the same slow5 file (using 'head -10'), which threw no exceptions.

passed-1K.zip

Best, Denis

Psy-Fer commented 8 months ago

Hey Denis,

The issue seems to be the readID in your blow5 file.

for example: S1_333!ebce92f7-c7e7-465d-b946-1af6b575c929_parent_read_id=ebce92f7-c7e7-465d-b946-1af6b575c929_model_version_id=dna_r10.4.1_e8.2_400bps_sup-v4.1.0_mean_qscore=2!0!13!+

If I cut this out and modify it to be a valid UUID ebce92f7-c7e7-465d-b946-1af6b575c929 then the basecalling works fine.

The issue is the read is failing the UUID validation inside dorado/guppy, and that is sending the read back un-basecalled, and therefore now mean_qscore field is present. Given the API for dorado/guppy doesn't really give me many options to handle this, I just let it fail.

Anyway, if you fix your readIDs it should work as intended.

let me know how you get along.

James

Psy-Fer commented 8 months ago

Also just a note about your usage of --slow5_threads 128

It might be helpful to read this document on the thread model design for buttery-eel https://github.com/Psy-Fer/buttery-eel/blob/main/docs/thread_model.md

That should help understand how many --procs and --slow5_threads to use for best results.

denisbeslic commented 8 months ago

Dear James,

thank you for your suggestions and help! I modified the read IDs on my slow5 file. However, I again encountered the 'mean_qscore' error at a later position. I used slow5tools skim function to have a look at the metadata. The read ID looks fine, I think the signal length (=50) of the responsible read was the source of the error. Is there a minimum signal length for guppy/buttery-eel? Would it be possible to just skip the read instead of failing the basecalling? Otherwise, I could also remove reads below a specific length.

Thanks again, Denis

Psy-Fer commented 8 months ago

Hey,

Yea I think I need to re-add my skipped read methods. Previously guppy would just power through, but Dorado seems to be somewhat more sensitive to this kind of thing.

Give me a little time to build the fix and run it through some tests, as well as make a new test for this condition, but manually truncating a read.

I'll get back to you soon with a new release.

Cheers, James

Psy-Fer commented 8 months ago

Hey Denis,

I have a branch called skipper that has the skip logic built into buttery-eel.

https://github.com/Psy-Fer/buttery-eel/tree/skipper

This will detect when dorado doesn't basecall a read, and just returns the read back. It then adds this info to a queue. At the end, it goes through that queue (if not empty), and will write to a skipped_reads.txt file

So testing your passed_1k.blow5 file, I get 3 errors that look like this in the file

read_id stage   error
S1_242!5ebcdb9d-ca72-4ccb-ac1b-729379d92bd5_parent_read_id=5ebcdb9d-ca72-4ccb-ac1b-729379d92bd5_model_version_id=dna_r10.4.1_e8.2_400bps_sup-v4.1.0_mean_qscore=2!0!17!+    stage-1 KeyError-'sequence'
S1_333!ebce92f7-c7e7-465d-b946-1af6b575c929_parent_read_id=ebce92f7-c7e7-465d-b946-1af6b575c929_model_version_id=dna_r10.4.1_e8.2_400bps_sup-v4.1.0_mean_qscore=2!0!13!+    stage-1 KeyError-'sequence'
S1_229!66adcb94-1c97-4b8e-bb6a-d8f704769c37_parent_read_id=66adcb94-1c97-4b8e-bb6a-d8f704769c37_model_version_id=dna_r10.4.1_e8.2_400bps_sup-v4.1.0_mean_qscore=1!0!7!+ stage-1 KeyError-'sequence'

This limits spam in the terminal, and gives you the user, and me the dev, to know what's going on. So this is showing that it couldn't find a sequence. (a bit more informative than the qscore). So you can see it wasn't basecalled. It also tells you which reads. And for me, which stage it failed in the process reads method. So if something else pops up, it is a lot easier for me to troubleshoot.

I need to do a fair bit of testing on this before I merge it cut a release. But for your data, this branch might be a good quick way for you to get your analysis done with this latest version of dorado while I do all the testing.

Let me know if you have any feedback or suggestions.

Cheers, James

Psy-Fer commented 8 months ago

Also i have found out why these 3 are not called. Their signal lengths are less than the minimum threshold required by dorado (100).

these 3 have a signal length of: 95 38 61

So another strategy is to filter those first. I might also add a pre-filter to buttery-eel

James

hasindu2008 commented 8 months ago

Hey @Psy-Fer I think simply skipping them and adding to the skip_list is a good idea. This way, we at least know that some reads actually got skipped because they are short or for whatever reasons. We do not know if this is the only filtering strategy used in Dorado or if they will add things in future. If we simply skip and add to a list, at least it will be future-proof. But one downside is, that the user may get confused/worried when a few reads are skipped, which can be eradicated by documenting this or telling a brief potential reason in the warning.

denisbeslic commented 8 months ago

Hey James, thank you for your work! I think the skip list is a nice approach to track what went wrong. It works fine for my case. As already said, a warning or documentation on the possible source of the error KeyError-'sequence' would be helpful at some point.

Psy-Fer commented 2 months ago

TODO: add skip list and update docs to explain what these do.

NOTE: Latest dorado-server, 7.4.12 changed the 100 signal limit, though unsure what they did.

From their release notes: Very short reads under 100 raw samples no longer cause a crash in ont_basecall_client

Psy-Fer commented 1 month ago

This should now be fixed in https://github.com/Psy-Fer/buttery-eel/releases/tag/v0.5.0

cheers James