barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
137 stars 21 forks source link

ERROR: Seq ID not found in reference sequence files provided for entry #369

Closed BrinleyL closed 2 months ago

BrinleyL commented 3 months ago

I tried to annotate a vcf file(VCFv4.2) from the sequencing center, so I converted them to .gd file using "gdtools VCF2GD" command. Then, I used command like "gdtools COMPARE -r myreference.gbk myvcf.gd". But it produced errors and said it cannot find the seq-id in reference file. I tried "VALIDATE" command. It also produced errors: "ERROR: Seq ID not found in reference sequence files provided for entry". I checked my files and find seq-id was right there. Not sure why it doesn’t work. Below are first few lines of my files. The package version is 0.38.2. Thank you.

myvcf.gd:

Screenshot 2024-04-04 at 3 10 47 PM

myreference.gbk:

Screenshot 2024-04-04 at 3 11 29 PM

errors from VALIDATE:

Screenshot 2024-04-04 at 3 13 52 PM
jeffreybarrick commented 3 months ago

The mismatch is likely between the use of CP000431.1 by whatever program generated the original VCF file and just CP000431 by breseq. Try doing a find/replace with a text file to convert all occurrences to CP000431.

breseq loads the seq_id from GenBank files with this preference: The sequence ID is assigned with this order of preference LOCUS > ACCESSION > VERSION

BrinleyL commented 3 months ago

Thank you, Jeffery! Does "the sequence ID is assigned with this order of preference LOCUS > ACCESSION > VERSION" mean breseq will search down to the VERSION level? If so, VERSION in my genbank file is CP000431.1 which is the same as the VCF file, that's why I assumed breseq can find it. And I previously tried to use --genbank-field-for-seq-id VERSION as the manual suggested to override the default order but it said no such command. Thanks again.

jeffreybarrick commented 2 months ago

Yeah.... unfortunately, it doesn't march through the choices in that order until it finds a match. It searches in that order and stops and assigns it when one of those is present. The --genbank-field-for-seq-id option is only available for the main breseq command... not all the gdtools command variants.

Either of these could be implemented, of course, if someone wanted to add them to the code, but it's probably easiest to just do a search-and-replace on a file generated by a different program/pipeline or maybe you can edit the LOCUS line of the GenBank to use CP000431.1 if it fits/validates when used as an input by breseq.

BrinleyL commented 2 months ago

Hi @jeffreybarrick

Thank you for your clarification! I just edited my reference file at Locus and it works! I provided multiple .gd files with input and the output file showed different percentage for the sample column. Could you tell me what those percentages mean (a figure is attached below)? Thanks.

Screenshot 2024-04-09 at 11 11 28 AM
jeffreybarrick commented 2 months ago

Great!

The 100% means the mutation (names in the row) is present in that sample (named in the column). Blank cells mean 0% (the mutation is not present in that sample).

For some types of analyses (of mixed samples), there can be predictions that are between 0% and 100%, but this does not seem to apply to your input files.

BrinleyL commented 2 months ago

Thank you, Jeffrey! I feel clear now.