Non-intuitive WARNING + FATAL ERROR when using multiple references with same-name LOCUS definitions

dannagifford commented 2 years ago

Hello, Description of problem I am using breseq 0.37 with two reference GBK, each with multiple LOCUS entries. Each file contains a ~5 Mb chromosome and 1-5 plasmids (varying lengths). A feature of breseq is that it can take multiple reference sequences, and I've used that successfully in the past. In this case I'm trying to look for HGT (which might not work for other reasons, see #314) between two different reference genomes annotated by Prokka. However, breseq throws an error, that if each GBK contains LOCUS definitions with the same name, breseq can't work out how long each LOCUS should be.

For example, consider 2 files WS25.gbk and WS89.gbk. Each has LOCUS definitions numerically numbered, starting at '1' for each file. These are Prokka outputs.

WS25.gbk contains 3 LOCUS definitions: LOCUS 1 5273430 bp DNA linear 02-JUL-2021 LOCUS 2 134084 bp DNA linear 02-JUL-2021 LOCUS 3 441 bp DNA linear 02-JUL-2021

WS89.gbk contains 5 LOCUS definitions: LOCUS 1 4954463 bp DNA linear 02-JUL-2021 LOCUS 2 134158 bp DNA linear 02-JUL-2021 LOCUS 3 41012 bp DNA linear 02-JUL-2021 LOCUS 4 34653 bp DNA linear 02-JUL-2021 LOCUS 5 5167 bp DNA linear 02-JUL-2021

breseq command: breseq -k -j 6 -r WS25.gbk -r WS89.gbk -o 8925T3_2/. *.gz

It first warns

----------------------------------> WARNING <-----------------------------------
Length assigned to sequence '3' from LOCUS line (441) does not match length from source feature (4954463). Length from LOCUS line will be used. If you encounter further errors, make sure this length matches the true length of your sequence.

Then it fails with:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Error in reference file. Feature of type [CDS] named [] on sequence [3] has coordinates (70-1404) that are outside of the sequence bounds (1-441).
FILE: reference_sequence.cpp   LINE: 1849
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

i.e. it seems to confuse which CDS features belong to which LOCUS.

Expected behaviour If breseq received multiple same-name LOCUS definitions, perhaps it should stop with an error stating such

Workaround Rename each LOCUS definition to a unique name

jeffreybarrick commented 2 years ago

Thanks for pointing this out. I agree, that's not a very good warning!

I'll plan to have it error when a nucleotide sequence already exists for a given seq_id when it loads a new file that refers to the same seq_id.

FWIW for anyone who is curious, breseq does let you split reference annotations across different files, so you could load a GFF3/GenBank file with features but no sequence and then load the nucleotide sequence from a different file. Theoretically, you could even load multiple feature-only GFF3/GenBank files for the same reference sequence, and it would keep all of their features.

jeffreybarrick commented 2 years ago

Actually... I'm having trouble reproducing this problem. When I try to load two GBK's or GFF's with the same seq_id, breseq gives me errors along these lines:

Duplicate seq id found in file '/Users/jbarrick/src/breseq/tests/data/lambda/lambda_bad_orfs.gbk'! 
Features for 'NC_001416' were already loaded from file '/Users/jbarrick/src/breseq/tests/data/lambda/lambda.gbk'.

or

Duplicate seq id found in file 'lambda_bad_orfs.gff3'! 
DNA sequence for 'NC_001416' was already loaded from file 'lambda.gff3'.

Could you email me your two input files, so I can see why this is?

dannagifford commented 2 years ago

I have emailed you the files that gave me the issue.

jeffreybarrick commented 2 years ago

Thanks! This did let me reproduce the error.

I updated things so here's the output now for the next version of breseq:

----------------------------------> WARNING <-----------------------------------
Length assigned to sequence '3' from LOCUS line (441) does not match length previously assigned from source feature (4954463). The larger of the two lengths will be used. If you encounter further errors, make sure LOCUS lengths match the true lengths of your DNA sequences.
--------------------------------------------------------------------------------
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Duplicate information for sequence found in file '35256E_WS89.gbk'!
Features for sequence '3' were already loaded from file '35248E_WS25.gbk'.
Check that you do not have duplicate sequence names/IDs in your reference files.
FILE: ./libbreseq/reference_sequence.h   LINE: 672
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

BTW: I was wrong before in my comment about breseq allowing multiple files containing features for one reference. It will let you split the DNA sequence and features into two separate files, but you can only have one DNA sequence file and one feature file for each reference.

padpadpadpad commented 2 years ago

Hi both! I came a cropper to this error just now trying to use multiple .gbk files produced by Prokka.

Any ideas what would be the easiest/quickest way to rename features/locus definitions in .gbk files to get this to work?

EDIT. I simply opened each reference in Visual Studio Code and did a "find and replace" with the word "NODE". Not overly pretty but was achievable for the 5 genomes I wanted to use for breseq. Seems to now be working.

Cheers Dan

jeffreybarrick commented 2 years ago

Something like that was going to be my suggestion. Glad you found a workaround!

You might be able to head off the problem earlier by taking your assembly FASTAs and giving the contig names in each of those files a different prefix using find/replace, or a short BioPython script if you want to automate. If you do this before Prokka and maybe pass it some of the options that make it give specific locus tag prefixes (though I don't think this is strictly necessary), it seems like you should be able to disambiguate things for input into breseq.

dannagifford commented 2 years ago

I used BASH to append the file name (let's say strainX.gbk) to change the LOCUS 1 text to say instead LOCUS strainX_1.

Something like this (which I just wrote out on a smartphone so there could be errors)

for i in *.gbk do j=echo ${i%.gbk} sed -r 's/(LOCUS\t)([0-9]+)/\1${j}_\2/g' $i done

Check it's doing the right substitution then change sed -r to sed -r -i for in place substitutions

Danna

On Wed, Aug 17, 2022, 11:39 AM Daniel Padfield @.***> wrote:

Hi both! I came a cropper to this error just now trying to use multiple .gbk files produced by Prokka.

Any ideas what would be the easiest/quickest way to rename features/locus definitions in .gbk files to get this to work?

— Reply to this email directly, view it on GitHub https://github.com/barricklab/breseq/issues/315#issuecomment-1217839309, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBVJ3OJ6CZ4LOBHAY5TPC3VZS6MNANCNFSM55H6BA6A . You are receiving this because you authored the thread.Message ID: @.***>

barricklab / breseq

Non-intuitive WARNING + FATAL ERROR when using multiple references with same-name LOCUS definitions #315