barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
137 stars 21 forks source link

question about criteria for assigning genes_promoter to intergenic region #363

Closed gabypetrungaro closed 5 months ago

gabypetrungaro commented 5 months ago

Hi, I'm using breseq to call mutations in populations resulting from a lab evolution experiment. I encounter some mutations in intergenic regions, listed by breseq as geneX/geneY . In the column genes_promoter, sometimes geneX is listed, sometimes geneY and sometimes none. I would like to know the criteria for assigning the mutation position to a specific gene promoter, as I do not find that specific information in the annotated reference sequence file.

Specifically, I would be very grateful if you could answer the following questions: 1) By exploring the mutations a bit, I can guess that when the genes before and after the intergenic region geneX/geneY correspond to the strand -1, promoter is assigned to geneX (the previous gene), and if strand is 1, then the promoter is assigned to geneY (the next gene). Could you please confirm that this is the case? Does the distance to each of the genes play any role here? 2) What about the cases where the genes correspond to different strands? How is a genes_promoter assigned in that case? 3) Is there a minimum distance to the gene for assigning the mutation to the promoter? If so, what is the minimum distance that Breseq considers? 4) When is an intergenic mutation not assigned to any of the two genes promoter? Thank you in advance, Gabriela.

jeffreybarrick commented 5 months ago

If you get the full help for this command, then I think it answers all of your questions:

gdtools ANNOTATE -h

Relevant snippet:

MUTATION EFFECTS CLASSIFICATION

Each mutation has a 'genes_overlapping' list assigned based on the genes it overlaps. If the mutation affects a position
within --inactivating-overlap-fraction of the length of an overlapping gene from its start, the gene is moved to the 
'genes_inactivated' list if it is also a MOB, SNP causing a nonsense mutation, or an INS, DEL, or SUB that results in a size 
change that is <= the --inactivating-size and results in a frameshift in a protein-coding gene or an INS, DEL, SUB with a 
size change > the --innactivating-size for any type of gene even if it is in-frame in a protein-coding gene. If there are no 
'genes_overlapping' or 'genes_inactivated', a mutation has a 'genes_promoter' list assigned to all genes that are <= the 
--promoter_cutoff bp upstream of a gene. There can be multiple qualifying genes assigned to each lists, but each gene 
will only be in one of the lists.

You can tune some of the parameters with these command-line options. The defaults are given here:

  --inactivating-overlap-fraction <arg>                                                                    
                                   Mutations within this fraction of the length of a gene from its
                                   beginning are assigned to the 'genes_inactivating' versus 'the
                                   genes_overlapping' list if they fulfill other criteria. (DEFAULT=0.8)
  --inactivating-size <arg>        INS, DEL, and SUB mutations in genes that are longer than this length
                                   cutoff are always assigned to the 'genes_inactivating' list, even if
                                   they are in-frame or noncoding genes. (DEFAULT=15)
  --promoter-distance <arg>        Mutations upstream and within this distance of the beginning of a gene
                                   have it added to their 'genes_promoter' list. (DEFAULT=150)

Let me know if you have any remaining questions.

gabypetrungaro commented 5 months ago

Thank you for your answer! I'm afraid I cannot get the full help by typing gdtools ANNOTATE -h . I copy the output at the bottom of this message, but it does not include any of the information that you copied in your message.. How could I get that full information?

I wanted to reproduce this myself because the default values listed in your message do not seem to coincide with the ones in my runs of breseq... (can that be?!) --> I have intergenic mutations with a promoter listed that when I check by hand I find a distance to the gene > 150 .. I copy one example below:

frequency 1 gene_name typA/yihL gene_position intergenic (+42/-175) gene_product GTP-binding protein/putative DNA-binding trans... gene_strand >/> genes_promoter yihL locus_tag BW25113_3871/BW25113_3872 locus_tags_promoter BW25113_3872 mutation_category small_indel position 4053632 position_end 4053632 position_start 4053632 ref_seq T seq_id CP009273 size 1 time -1 title 8965_839_F8_uvrD treatment NIT type DEL classification_category promoter

From the reference sequence file: 'typA': SeqFeature(FeatureLocation(ExactPosition(4051766), ExactPosition(4053590), strand=1), type='gene')), 'yihL': SeqFeature(FeatureLocation(ExactPosition(4053806), ExactPosition(4054517), strand=1), type='gene'))

If I subtract the mutation position from the start of the gene, the result is 174 = 4053806−4053632 , which is greater than the listed default value of 150. Is there something wrong in what I am doing?

Thank you again for your answers. Gabriela.

$ gdtools ANNOTATE -h

================================================================================ breseq 0.36.1 http://barricklab.org/breseq

Active Developers: Barrick JE, Deatherage DE Contact: jeffrey.e.barrick@gmail.com

breseq is free software; you can redistribute it and/or modify it under the terms the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version.

Copyright (c) 2008-2010 Michigan State University Copyright (c) 2011-2017 The University of Texas at Austin

If you use breseq in your research, please cite:

Deatherage, D.E., Barrick, J.E. (2014) Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseq. Methods Mol. Biol. 1151: 165–188.

If you use structural variation (junction) predictions, please cite:

Barrick, J.E., Colburn, G., Deatherage D.E., Traverse, C.C., Strand, M.D., Borges, J.J., Knoester, D.B., Reba, A., Meyer, A.G. (2014) Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq. BMC Genomics 15:1039.

gdtools ANNOTATE/COMPARE [-o annotated.html] -r reference.gbk input.1.gd [input.2.gd ... ] -h,--help Display detailed help message -o,--output Path to output file with added mutation data. (DEFAULT: output.*) -r,--reference File containing reference sequences in GenBank, GFF3, or FASTA format. Option may be provided multiple times for multiple files (REQUIRED) -f,--format Type of output file to generate. See options below (DEFAULT=HTML) -a,--add-html-fields Add formatted fields that are used for generating HTML output. Only applicable to GD and JSON output formats -b,--add-text-fields Add formatted fields in UTF-8 encoded text that are similar to those used in HTML output. Only applicable to GD and JSON output formats --ignore-pseudogenes Treat pseudogenes as valid ORFs for calling amino acid substitutions --repeat-header In HTML mode, repeat the header line every this many rows (0=OFF) (DEFAULT=0) -p,--phylogeny-aware Check the optional 'phylogeny_id' field when deciding if entries are equivalent -g,--region Only show mutations that overlap this reference sequence region (e.g., REL606:64722-65312) -e,--preserve-evidence By default evidence items with two-letter codes are removed (RA, JC, MC, ...). Supply this option to retain them. Only affects output in GD and JSON formats. This option can only be used with a single input GD file (i.e., not in COMPARE mode). -c,--collapse Do not show samples (columns) unless they have at least one mutation

ANNOTATE mutations in one or more GenomeDiff files. If multiple input files are provided, then also COMPARE the frequencies for identical mutations across samples.

Valid output formats: HTML Descriptive table viewable in a web browser GD GenomeDiff with added annotation of mutations TABLE Comma-separated values UTF-8 text file suitable for input into Excel or R Simplified output with the same content/format as HTML output. TSV/CSV Tab- or comma-separated values file suitable for input into R Detailed output with all GD fields. One line per mutation per sample. PHYLIP Alignment of genotypes in PHYLIP format for phylogenetic analysis FASTA Multi-FASTA alignment of genotypes for phylogenetic analysis JSON JavaScript object notation file suitable for parsing

When multiple GD files are provided, the #=TITLE metadata line in each file is used to name each sample/column. If that information is not present, then the GD file name is used (removing the *.gd suffix).

In output, frequencies of 'D' mean that this mutation occurs within a region that is deleted by a different mutation in the genome in question. Frequencies of '?' indicate that there were not enough aligned reads to call a mutation at this position in the genome in question (either for or against the mutation).

PHYLIP/FASTA FORMAT

In this output format, each column in the genotype 'sequence' that is created corresponds to a unique mutational event. For SNPs the base present at that position in each genome is shown. For other types of mutations, 'A' is used for the ancestral allele (e.g., no transposon insertion), and 'T' is used for the derived allele (e.g., new transposon copy inserted). 'N' is used when it is ambiguous as to whether the mutation occurred in the lineage leading to this genome, either because the position is deleted or there are not sufficient reads aligned to a position to call a mutation (i.e., is inside an UN region).

PHYLIP/FASTA output is designed to be input into other programs to create a phylogenetic tree by parsimony methods. In these analyses, be sure that you use basic parsimony that weights all genotype sequence changes equally (e.g., same transition/transversion rates).

INPUT INTO EXCEL

You can load files output in the TABLE or HTML formats into Excel. For loading TABLE output, open an existing workbook and 'Import' as a Text file. You MUST choose Unicode (UTF-8) as the file origin and a comma as the delimiter. For loading HTML output, choose 'Import' as an HTML file.

jeffreybarrick commented 5 months ago

The command was updated and made configurable/explainable in more recent versions ofbreseq than v0.36.1.

I'd recommend that you update to a newer version. At least v0.38.1 or greater should look the same.

I can't say for sure what the logic was in older versions.

gabypetrungaro commented 5 months ago

Hi, Ohh, okay, thanks! I have no idea why I am using an older version.. I will try out the newer one! Best wishes, Gabriela.