barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
138 stars 21 forks source link

What does a mutation labeled gene1 | gene2 means? #374

Open gabypetrungaro opened 3 weeks ago

gabypetrungaro commented 3 weeks ago

Hi, I am running breseq to identify mutations in E. coli samples and I've noticed that a few times mutations appear where in the gene_name column I have something with the format gene1|gene2. I could not find this cases in the documentation.. What does this mean? Below is one example:

duplication_size 12.0 frequency 1.0 gene_name ybjC|nfsA gene_position coding (276-287/288 nt)|coding (5-16/723 nt) gene_product DUF1418 family protein|nitroreductase A, NADPH... gene_strand >|> genes_inactivated nfsA genes_overlapping ybjC locus_tag BW25113_0850|BW25113_0851 locus_tags_inactivated BW25113_0851 locus_tags_overlapping BW25113_0850 mutation_category mobile_element_insertion mutator_status False position 886644 position_end 886655 position_start 886644 ref_seq CGCCAACCATTG repeat_name IS4 repeat_size 1426.0 seq_id CP009273 snp_type | strand -1.0 time -1 type MOB

Thanks a lot, Gabriela.

jeffreybarrick commented 3 weeks ago

This means that the mutation overlaps two different genes (because their coordinates overlap). For any gene-specific information the information about the first one is before the | and the information about the second one is after a |

gabypetrungaro commented 3 weeks ago

Thanks, I don't completely understand what you mean with that their coordinates overlap. Is that something that I could see directly in my reference sequence? How could I see it? Thanks again

jeffreybarrick commented 3 weeks ago

Yes, you can see this in your reference sequence.

If you look at the start and end coordinates of the genes ybjC and nfsA, you'll notice that the same base pairs are contained in the beginning of nfsA and the end of ybjC. The mutation is contained in this overlap, so it affects both reading frames.

gabypetrungaro commented 2 weeks ago

Thanks a lot. This is clear now but it brought me back to a question about promoters that I asked some time ago.

I understood back then that breseq considers 150bp downstream the gene as the promoter region. Are cases like the one here also handled in that way? I mean cases where immediately before the gene (or actually even overlapping) there is another gene. In this case for example the promoter is the same for both genes and it is located downstream the first gene. However, as far as I understand breseq does not take this into account. Am I right? In other words: would a mutation within 150bp downstream of nfsA would be listed as a mutation in nfsA promoter or/and a mutation in the ybjC gene?

jeffreybarrick commented 2 weeks ago

To assign a mutation as being in a "promoter region", it must not overlap any genes.

In the current version of breseq, genes are annotated as being in the promoter region of a gene if they are within a 150-bp window upstream of that gene's start codon (when running gdtools ANNOTATE). If multiple genes pass that test, all should be listed under genes_promoter and locus_tags_promoter.

In earlier versions of breseq, only the one closest gene was annotated as being in the promoter region if the mutation was within this 150 bp window for multiple genes.

gabypetrungaro commented 2 weeks ago

Okay, I get it, thanks. Just to check that I understand correctly: this would mean that there is no way that breseq reports a mutation in the promoter of nfsA, as its promoter is located upstream of the overlapping gene ybjC (they are co-transcribed) and the ybjC gene is more than 150bp long. Is this right?

Based on this understanding, I was curious if ybjC is listed under gene_promoter in any other mutations. However I noticed that for all the mutations that breseq finds in the intergenic region before the start of this gene (grxA/ybjC), the promoter listed is the other gene (grxA) . I don't really understand this. The distance from the mutation to the ybjC start seems to be smaller than 150bp. [see e.g. below -- my ref. genome is E. coli with accession number CP009273, where ybjC starts at position 886369 . https://www.ncbi.nlm.nih.gov/projects/sviewer/?id=CP009273 ]

Is this maybe because that is not upstream for ybjC? What should I look at to see that the mutation is upstream the gene? Is this information in the strand? whether this is -1 or +1? If that's the case then I don't think that explains it..

Thank you again for answering my questions.

gene_name genes_promoter position strand grxA/ybjC grxA 886354 -1.0 grxA/ybjC grxA 886335 -1.0 grxA/ybjC grxA 886335 -1.0 grxA/ybjC grxA 886335 -1.0

duplication_size 9.0 frequency 1.0 gene_name grxA/ybjC gene_position intergenic (-145/-7) gene_product glutaredoxin 1, redox coenzyme for ribonucleot... gene_strand </> genes_promoter grxA locus_tag BW25113_0849/BW25113_0850 locus_tags_promoter BW25113_0849 mutation_category mobile_element_insertion mutator_status False position 886354 position_end 886362 position_start 886354 ref_seq TGCAGAGGG repeat_name IS1 repeat_size 768.0 seq_id CP009273 strand -1.0 time -1 type MOB

jeffreybarrick commented 2 weeks ago

breseq doesn't "know" anything about co-transcription of genes, operon structure, or transcription start sites. That information isn't present in the reference genome. So, it can only use the start, end, and strand of each individual gene relative to the mutation to give a guess about whether a mutation is in a putative promoter region.

If I'm understanding your case correctly, the reference genome has these gene annotations:

complement(885952..886209)
/gene="grxA"

886369..886656
/gene="ybjC"

Your mutation is a duplication affecting bases 886354-886362

So, it is within 150 bp upstream (=before) each gene.

You are right, it should be assigning it to both!

I see now that there is a logic error in the new code where it only assigns it to the first gene (the one at lower coordinates) rather than to both. I'll have to fix this in an upcoming version!!

Another choice the code is making right now is that it will only assign a gene to genes_promoter if there are no intervening genes that are closer to the mutation on that side (so if there was a very short gene, like 90 bp, followed by a another gene in the same operon on one side, it would only assign it as in the promoter region of the first 90 bp gene).