fedarko / strainFlye

Pipeline for analyzing (rare) mutations in metagenome-assembled genomes
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Reject contigs with names that would break bcftools #65

Open fedarko opened 1 year ago

fedarko commented 1 year ago

I'm currently working with a contig named "160+,38-,88-,106-,51+,23+,92-,34+,158-,167+,41+,131+,81+,58+,102+,192-(circular)". I was able to align reads to this contig and perform r-mutation calling in it; however, when I get to the part of strainFlye call r-mutation that attempts to convert the resulting VCF file to a BCF file, I get the following error:

...
call r-mutation @ 1,989.53s: Converting the VCF file to a compressed BCF file...
[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=160+,38-,88-,106-,51+,23+,92-,34+,158-,167+,41+,131+,81+,58+,102+,192-(circular),length=388097>"
...

I'm pretty sure the commas and parentheses are the kicker here. The VCF v4.3 spec confirms this:

Contig names follow the same rules as the SAM format’s reference sequence names: they may contain any printable ASCII characters in the range [!-~] apart from ‘\ , "‘’ () [] {} <>’ and may not start with ‘*’ or ‘=’. Thus they match the following regular expression:

[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*

In particular, excluding commas facilitates parsing ##contig lines, and excluding the characters ‘<>[]’ and initial ‘*’ avoids clashes with symbolic alleles. The contig names must not use a reserved symbolic allele name.

The obvious solution here is thus updating strainFlye's FASTA-reading function (?) to reject badly-named contigs like this. We could maybe make this check only required for strainFlye commands that will use bcftools or something, but it's less of a headache (and probably good practice) to just apply this check to every time we read a FASTA file.

This is a similar idea to #57 -- however, addressing that issue in its entirety would require some extra work. (For example: The regex given by the VCF spec apparently allows contigs with / characters in their names, but I have to imagine that if your contigs actually have / in their names then a swarm of locusts will devour your terminal or something when you start trying to use these names in files.)