fritzsedlazeck / Sniffles

Structural variation caller using third generation sequencing
Other
564 stars 95 forks source link

"--output-rnames" fails to escape ; in rnames breaking VCF when used with ONT duplex reads #403

Open mp15 opened 1 year ago

mp15 commented 1 year ago

ONT duplex reads have RNAMEs in the form of ;. Sniffles naively outputs this into VCF when you use the --output-rnames flag without escaping the ; meaning that VCF parsers think the second read name is another INFO field. Escaping these ;'s is recommended perhaps by \%3B as documented in the 4.3 spec section 1.2?

fritzsedlazeck commented 1 year ago

Uh thats something new... I am not sure how to handle this as if we replace this symbol this will no longer be traceable.. I will reach out to Nanopore and ask. Thanks Fritz

wdecoster commented 1 year ago

Hmm, not great, but you could consider replacing all ; in read names with e.g. +. Ideally ONT would have picked a different separator, but that ship has sailed, probably.

mp15 commented 1 year ago

The thing is there already is a defined way to escape these characters in the VCF specification the % encoding as I mentioned above, there's no need to make a new way up. However, I'm not sure this should necessarily be done at the Sniffles level, I suggest that this escaping and unescaping should be done transparently at the PySam VCF layer so that libraries using it can just treat strings as strings.

mp15 commented 1 year ago

Actually scratch that, it looks like you're doing a lot of raw VCF handling yourselves so you might need to do the escaping too?

fritzsedlazeck commented 1 year ago

Yeah. Honestly just talked to ONT and they might change it. I understand that escaping character, but not sure how igv or other tools would handle that... Thanks Fritz