InvalidReferenceSequenceName: Exception due to brackets in reference name

abdenlab / oxbow

Read specialized NGS formats as data frames in R, Python, and more.

https://lifeinbytes.substack.com/p/breaking-out-of-bioinformatic-data-silos

Apache License 2.0

59 stars 8 forks source link

InvalidReferenceSequenceName: Exception due to brackets in reference name #47

Closed JackCurragh closed 1 year ago

JackCurragh commented 1 year ago

Hello,

I have been using oxbox quite a lot since that initial blog post and have just run into a new issue that I am hoping there may be a work around for. It occurs when I try to read a BAM aligned to the SacCer genome with STAR as the tRNA reference names contain brackets.

eg. in the Ensembl annotation they have:

tI(AAU)I1_tRNA-E1

I assume these brackets are the root of the issue. Is there any chance that this could be handled within oxbow? Or is it a limitation imposed by arrow?

Thanks in advance.

GarrettNg commented 1 year ago

Hi Jack,

Thanks for your interest in oxbow.

The InvalidReferenceSequenceName error is coming from our upstream dependency noodles which is reading/parsing the bam file. It looks like you're right that the brackets in the reference name are causing an issue. The name parsing happens in noodles here. noodles is aiming for compliance with file format specs (sam spec page 3 section 1.2.1) and errors out on this kind of thing.

Unfortunately, we don't have a way of handling nonconformant data with oxbow at the moment.

JackCurragh commented 1 year ago

Ah yes that makes a lot more sense than arrow.. not sure what I was thinking when I wrote that!

Thank you for your response. I will have to find a way to sanitise the BAMs beforehand I guess!