Psy-Fer / blue-crab

crab go snap snap
MIT License
29 stars 2 forks source link

ERROR: No enum auxiliary type exists. At src/slow5.c:1458 #9

Open cabbagesofdoom opened 10 months ago

cabbagesofdoom commented 10 months ago

Hi @Psy-Fer,

I am trying to convert some blow5 files to pod5 and get this error:

[slow5_get_aux_enum_labels::ERROR] No enum auxiliary type exists. At src/slow5.c:1458
06-Dec-23 14:45:55 - pyslow5 - [WARNING]: get_aux_enum_labels enum_labels is NULL
06-Dec-23 14:45:55 - pyslow5 - [WARNING]: get_header_value header value not found: ip_address - rg: 0
06-Dec-23 14:45:55 - pyslow5 - [WARNING]: get_header_value header value not found: local_bc_comp_model - rg: 0
06-Dec-23 14:45:55 - pyslow5 - [WARNING]: get_header_value header value not found: mac_address - rg: 0
Traceback (most recent call last):
  File "/srv/scratch/babsgenome/snakes/blow5/tiger/blue-crab-venv/bin/blue-crab", line 8, in <module>
    sys.exit(main())
  File "/srv/scratch/babsgenome/snakes/blow5/tiger/blue-crab-venv/lib/python3.10/site-packages/src/blue_crab.py", line 1529, in main
    slow52pod5(args)
  File "/srv/scratch/babsgenome/snakes/blow5/tiger/blue-crab-venv/lib/python3.10/site-packages/src/blue_crab.py", line 713, in slow52pod5
    s2s_s2p_worker(args, sfile, pod5_out)
  File "/srv/scratch/babsgenome/snakes/blow5/tiger/blue-crab-venv/lib/python3.10/site-packages/src/blue_crab.py", line 1161, in s2s_s2p_worker
    s5_end_reason = slow5_end_reason_labels[read.get("end_reason", 0)]
IndexError: list index out of range

Any ideas of what might cause this and how I might fix it?

Thanks!

Rich

Psy-Fer commented 10 months ago

Oh yea that's my bad.

Let me fix that and get back to you.

Psy-Fer commented 10 months ago

Hey,

Any chance you could show me the header columns of your data?

The first 2 lines above the actual reads (and below the header values)

It should be something like

#char*  uint32_t        double  double  double  double  uint64_t        int16_t*        enum{unknown,partial,mux_change,unblock_mux_change,data_service_unblock_mux_change,signal_positive,signal_negative}     char*   double  int32_t uint8_t uint64_t
#read_id        read_group      digitisation    offset  range   sampling_rate   len_raw_signal  raw_signal      end_reason      channel_number  median_before   read_number
     start_mux       start_time

What I'm looking for here is this part of it

enum{unknown,partial,mux_change,unblock_mux_change,data_service_unblock_mux_change,signal_positive,signal_negative}

This is the list that blue-crab tried to get from your slow5 file. If it's not present or fails, it tries to make it a list of just ["unknown"].

It looks like it's trying to use a value that is outside the length of the list. So having a look at the list is a good start to see if there is anything weird going on there.

An easy way to get that value from a blow5 file is to run this command

slow5tools view reads.blow5 | less

and just scroll down to that header line and copy paste it here.

Thanks James

Psy-Fer commented 10 months ago

I have also just pushed a change to the dev branch that has a check on this line of code that will spit out what slow5_end_reason_labels is set to if it fails as a quick way to troubleshoot.

So another way is to switch to the dev branch, run pip install . again then try running the same conversion again and wait for it to hit the same error.

Thanks James

killidude commented 6 months ago

Hi @Psy-Fer,

I am also getting the same error:

blue-crab s2p minion_sim_1000_itrs.blow5 -o minion_test.pod5 05-Apr-24 19:03:38 - blue-crab - [INFO]: single2single: 1 s/blow5 file detected as input. Writing 1:1 s/blow5->pod5 to file: minion_test.pod5 05-Apr-24 19:03:38 - blue-crab - [INFO]: Opening s/blow5 file: minion_sim_1000_itrs.blow5 [slow5_get_aux_enum_labels::ERROR] No enum auxiliary type exists. At src/slow5.c:1458 05-Apr-24 19:03:38 - pyslow5 - [WARNING]: get_aux_enum_labels enum_labels is NULL Traceback (most recent call last): File "/home/tomas/.local/bin/blue-crab", line 8, in sys.exit(main()) File "/home/tomas/.local/lib/python3.10/site-packages/src/blue_crab.py", line 1529, in main slow52pod5(args) File "/home/tomas/.local/lib/python3.10/site-packages/src/blue_crab.py", line 713, in slow52pod5 s2s_s2p_worker(args, sfile, pod5_out) File "/home/tomas/.local/lib/python3.10/site-packages/src/blue_crab.py", line 1161, in s2s_s2p_worker s5_end_reason = slow5_end_reason_labels[read.get("end_reason", 0)] IndexError: list index out of range

The last two lines before the actual reads are:

char uint32_t double double double double uint64_t int16_tchar* double int32_t uint8_t uint64_t

read_id read_group digitisation offset range sampling_rate len_raw_signal raw_signal channel_number median_before read_number start_mux start_time

There is no enum{} in my files.

The slow5 files were generated (using the subprocess.run function of python) with the dna-r10-min model and full-contigs:

"squigulator " + fastaderep + " -x dna-r10-min -o ./tmp/tmp" + str(i) + ".slow5 --full-contig --seed " + str(random_numbers[i])

and then merged:

slow5tools merge tmp -o minion_sim_1000_itrs.slow5

The individual tmp files as well as the merged files have the same structure and no enum{} on line 9.

I tried the dna-r9-min model, and there also is no enum{} on line 9.

char uint32_t double double double double uint64_t int16_tchar* double int32_t uint8_t uint64_t

read_id read_group digitisation offset range sampling_rate len_raw_signal raw_signal channel_number median_before read_number start_mux start_time

Thanks,

Tomas

Psy-Fer commented 6 months ago

Ahh so these reads were built with squigulator?

I'll need to tell @hasindu2008 to put a dummy end_reason in the blow5 output.

In the meantime, I'll modify blue-crab to insert a dummy enum via an argument, making all reads end in the signal_positive state.

I'll get back to you in a sec

James

Psy-Fer commented 6 months ago

Hi Tomas,

Could you please try using the dev branch and showing me the error it gives you?

You can do this by activating your environment if you installed with pip from pypi, please clone the blue-crab repo git clone git@github.com:Psy-Fer/blue-crab.git Then go to the blue-crab repo and run git pull then git switch dev You can check it worked by running git status and it should say something like

On branch dev
Your branch is up to date with 'origin/dev'.

Then re-install this dev version into your env

pip install .

Now re-run your bluecrab command.

something fishy is going on, but this should figure it out.

Cheers, James

killidude commented 6 months ago

Hi James,

Thanks for looking into this.

Here is the output of the dev branch:

blue-crab s2p minion_sim_1000_itrs.blow5 -o minion_test.pod5 10-Apr-24 09:04:21 - blue-crab - [INFO]: single2single: 1 s/blow5 file detected as input. Writing 1:1 s/blow5->pod5 to file: minion_test.pod5 10-Apr-24 09:04:21 - blue-crab - [INFO]: Opening s/blow5 file: minion_sim_1000_itrs.blow5 [slow5_get_aux_enum_labels::ERROR] No enum auxiliary type exists. At src/slow5.c:1458 10-Apr-24 09:04:21 - pyslow5 - [WARNING]: get_aux_enum_labels enum_labels is NULL Traceback (most recent call last): File "/home/tomas/.local/bin/blue-crab", line 8, in sys.exit(main()) File "/home/tomas/.local/lib/python3.10/site-packages/src/blue_crab.py", line 1558, in main slow52pod5(args) File "/home/tomas/.local/lib/python3.10/site-packages/src/blue_crab.py", line 713, in slow52pod5 s2s_s2p_worker(args, sfile, pod5_out) File "/home/tomas/.local/lib/python3.10/site-packages/src/blue_crab.py", line 1364, in s2s_s2p_worker read_id=uuid.UUID(read["read_id"]), File "/usr/lib/python3.10/uuid.py", line 177, in init raise ValueError('badly formed hexadecimal UUID string') ValueError: badly formed hexadecimal UUID string

Psy-Fer commented 6 months ago

Ahh progress!

Okay so now the issue is the readID isn't a valid uuid. Again I think that's a squigulator issue.

@hasindu2008 what are the readIDs you make?

The issue here is that pod5 requires the readID to be a uuid. So I can't just use any old string.

Ideally squigulator would create these and then blue-crab just reads the string and converts it.

Another option is in the absence of valid uuids I add an option to create one. But then you can't link the old reads to the new reads (unless I make a tsv file that provides the mapping).

What do you think?

killidude commented 6 months ago

James,

I agree, the solution is to have valid UUID and a dummy end_reason in the slow5/blow5 output generated by squigulator @hasindu2008.

Not having this also most likely breaks the butterfly-eel wrapper.

Ultimately, I need to be able to basecall the simulated slow5/blow5 files generated by squigulator so I can use the called fastq files for downstream analyses.

Thanks,

Tomas

Psy-Fer commented 6 months ago

Buttery-eel I can unbreak by using dummy uuids when i basecall and then replace the original readID when the read comes back.

The issue is going over to pod5 you can't do this because of their strict typing. So yea, either squigulator produces uuids or I create them in blue-crab and give a file that maps squigulator readIDs with uuids.

Let's see what @hasindu2008 thinks and then we will implement it asap

James

hasindu2008 commented 6 months ago

Hey all,

The reason I adhere to the current readID format in squigulator is so that it is compatible with the "mapeval" utility in Minimap2's Paftools companion script. This is quite useful for assessing the mapping accuracy once the reads are basecalled. Also, I like deterministic read IDs compared to random ones.

It is very strange that POD5 needs the readid to be a UUID. Perhaps in their implementation, they simply store the UUID as a 128-bit integer instead of storing it as a variant-length string. This is not great, as this means POD5 is stuck with UUID forever as their read IDs, well, might change later and break backward compatibility. ReadID in many bioinformatics formats including BAM format has been a variable string.

Perhaps, I can implement Squigulator an option called --ont-friendly that produces some fake UUIDs for the read IDs, as well as a fake end_reason with the value "unknown". Let me know your thoughts on this. This way, there is no need for the blue crab to do any "UUIdification" of the readIDs. If you all are happy, I can implement this to squigulator ASAP.

By the way, @Psy-Fer, is this UUID thing applicable to buttery-eel too? It wasn't a problem when using ont-guppy-server with the eel. Perhaps they enforced this UUID in ont-dorado-server? If they have enforced it (which is of limited sense to me), I would be very glad if you could do some internal mapping with a fake uuid when sending to the ont-basecall-server, but write the original readID to the FASTQ/SAM.

hasindu2008 commented 6 months ago

Also cross-referencing to the issue in squigulator that raises the same issue: https://github.com/hasindu2008/squigulator/issues/13

Psy-Fer commented 5 months ago

Hey,

Okay I'll just make absolutely sure what pod5 is doing so we are 100% correct when we do this.

James

killidude commented 5 months ago

@hasindu2008 and @Psy-Fer

Perhaps, I (@hasindu2008) can implement Squigulator an option called --ont-friendly that produces some fake UUIDs for the read IDs, as well as a fake end_reason with the value "unknown". Let me know your thoughts on this. This way, there is no need for the blue crab to do any "UUIdification" of the readIDs. If you all are happy, I can implement this to squigulator ASAP.

I think this is a great solution that will maintain maximum compatibility for downstream use.

Thanks,

Tomas

Psy-Fer commented 5 months ago

Okay I have confirmed that pod5 requires a uuid type for the readID, even though it shouldn't have to be.

--s2p--
verbose=1
-------------------blue-crab version-------------------
SLOW5/BLOW5 <-> POD5 converter version: 0.1.0

-------------------testcase:1: .slow5 to .pod5-------------------
12-Apr-24 17:35:30 - blue-crab - [INFO]: single2single: 1 s/blow5 file detected as input. Writing 1:1 s/blow5->pod5 to file: ./test//data/out/s2p/a.pod5
12-Apr-24 17:35:30 - blue-crab - [INFO]: Opening s/blow5 file: ./test//data/raw/s2p/a.slow5
12-Apr-24 17:35:30 - pyslow5 - [WARNING]: get_header_value header value not found: ip_address - rg: 0
12-Apr-24 17:35:30 - pyslow5 - [WARNING]: get_header_value header value not found: mac_address - rg: 0
Traceback (most recent call last):
  File "/home/jamfer/pvenv/blue-crab-test/bin/blue-crab", line 8, in <module>
    sys.exit(main())
  File "/home/jamfer/pvenv/blue-crab-test/lib/python3.8/site-packages/src/blue_crab.py", line 1561, in main
    slow52pod5(args)
  File "/home/jamfer/pvenv/blue-crab-test/lib/python3.8/site-packages/src/blue_crab.py", line 713, in slow52pod5
    s2s_s2p_worker(args, sfile, pod5_out)
  File "/home/jamfer/pvenv/blue-crab-test/lib/python3.8/site-packages/src/blue_crab.py", line 1392, in s2s_s2p_worker
    writer.add_read(read)
  File "/home/jamfer/pvenv/blue-crab-test/lib/python3.8/site-packages/pod5/writer.py", line 256, in add_read
    self.add_reads([read])
  File "/home/jamfer/pvenv/blue-crab-test/lib/python3.8/site-packages/pod5/writer.py", line 292, in add_reads
    *self._prepare_add_reads_args(reads),
  File "/home/jamfer/pvenv/blue-crab-test/lib/python3.8/site-packages/pod5/writer.py", line 306, in _prepare_add_reads_args
    [np.frombuffer(read.read_id.bytes, dtype=np.uint8) for read in reads]
  File "/home/jamfer/pvenv/blue-crab-test/lib/python3.8/site-packages/pod5/writer.py", line 306, in <listcomp>
    [np.frombuffer(read.read_id.bytes, dtype=np.uint8) for read in reads]
AttributeError: 'str' object has no attribute 'bytes'
testcase 1 failed

This is what happens if we just parse a str

it's trying to access the bytes method on the uuid type specifically, as that is what they expect.

So yea, I think we need to go with dummy uuids, and just make a tsv file that maps the uuid with the more verbose read information you want to store.

James

hasindu2008 commented 5 months ago

@Psy-Fer I am implementing an option in squigulator to generate uuids for readids, so blue-crab does not need to do anything.

Please check if the buttery-eel is also broken due to this uuid thing?

Psy-Fer commented 5 months ago

Buttery-eel should be fine, unless they change something in the dorado server code https://github.com/Psy-Fer/buttery-eel/issues/32 I use to think it was an issue, but turned out it was just a change in how dorado-server handles reads that are too short.

Psy-Fer commented 5 months ago

I should probably merge the buttery-eel/skipped branch into main and do a release to handle this.

hasindu2008 commented 5 months ago

@killidude

If you compile squigulator from the dev branch, and specify the option --ont-friendly=yes it should be pod5 conversion compatible.

When you specify --ont-friendly=yes it will add a dummy end_reason and create fake UUID for read IDs so.

If you encounter issues let me know, thanks.

Seems like buttery-eel works even without things being uuid as James mentioned above.

killidude commented 5 months ago

James,

I agree, the solution is to have valid UUID and a dummy end_reason in the slow5/blow5 output generated by squigulator @hasindu2008.

Not having this also most likely breaks the butterfly-eel wrapper.

Ultimately, I need to be able to basecall the simulated slow5/blow5 files generated by squigulator so I can use the called fastq files for downstream analyses.

Thanks,

Tomas

killidude commented 5 months ago

@hasindu2008,

Thanks for implementing this option. I can now convert the squigulator generated files to pod5.

Thanks for your help,

Tomas

denisbeslic commented 3 days ago

Hi @Psy-Fer , I'm using squigulator (v0.4.0) with the --ont-friendly=yes parameter and blue-crab (v0.2.0): The error occurs during the conversion of a squigulator .slow5 file to .pod5. Here’s the error traceback:

04-Oct-24 16:12:25 - blue-crab - [INFO]: single2single: 1 s/blow5 file detected as input. Writing 1:1 s/blow5->pod5 to file: test.pod5
04-Oct-24 16:12:25 - blue-crab - [INFO]: Opening s/blow5 file: squigulator_reads.slow5
Traceback (most recent call last):
  File "/X.local/bin/blue-crab", line 8, in <module>
    sys.exit(main())
  File "/X/.local/lib/python3.8/site-packages/src/blue_crab.py", line 1562, in main
    slow52pod5(args)
  File "/X/.local/lib/python3.8/site-packages/src/blue_crab.py", line 717, in slow52pod5
    s2s_s2p_worker(args, sfile, pod5_out)
  File "/X/.local/lib/python3.8/site-packages/src/blue_crab.py", line 1195, in s2s_s2p_worker
    reason, forced = s2p_end_reason_convert(s5_end_reason)
  File "/X/.local/lib/python3.8/site-packages/src/blue_crab.py", line 94, in s2p_end_reason_convert
    "api_request": (p5.EndReasonEnum.API_REQUEST, False),
  File "/usr/lib/python3.8/enum.py", line 384, in __getattr__
    raise AttributeError(name) from None
AttributeError: API_REQUEST

I suspect this issue might be related to a recent pull request based on the new pod5 spec from about a month ago. Is there a way to avoid this error?

Psy-Fer commented 3 days ago

Hmm..make sure you have the latest pod5 version?

Which version do you have? Please do a pip list for me?

denisbeslic commented 3 days ago

Thank you for the fast answer, upgrading pod5 fixed the problem!