Need some advice for annotating *.gd file with unassigned junctions

barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.

http://barricklab.org/breseq

GNU General Public License v2.0

144 stars 21 forks source link

Need some advice for annotating *.gd file with unassigned junctions #255

Closed AMVeenstraSkirl closed 3 years ago

AMVeenstraSkirl commented 3 years ago

Hi Jeffrey,

Thanks a lot for your work and support on Breseq! I have been trying to find my question in your papers already, spefically the one together with Daniel from 2014 (Deatherage&Barrick, 2014; abbr. D&B,2014), that contains detailed instructions on how to incorporate unassigned junctions into the output.gd file. The instructions and different steps for your example seem clear to me, so I want to apply them to my own project.

In my case, there was no new junction reported in the predicted mutations, however there are 9 entries in the unassigned junction table. See here: 201229_unassignedJunctions

In general the values of the scores, skew and frequency seem legit enough for me to take them as serious calls and find out what has happened there (except for entry 5). The first two are not the issue. It's the other six entries that I am wracking my brains about. When looking closely at the positions, there are three pairs of entries connected to three sites on the genome. I have marked them with dark purple on the left hand side. Repeat sequences are involved in every entry suggesting some mobile element action. However, the two repeats in every pair are different from each other - in every pair. So, I think I cannot link it to a single jump of one mobile element for any of the three sites, as has been described for the example in points 3.2.10 and 3.2.11 in D&B, 2014.

I know a little bit what is going on for the entry marked in cyan. There seems to be a huge duplication of appr. 200,000 bp: CP002094_1000000-1250000

The duplication in the cit+ E. coli in the tutorial of the breseq documentation contains only one single entry. So, I am puzzling over the meaning of the second entry

My first idea was to check, whether the reads are mixed, which is the case for the white rows of each entry, while the reads of the orange rows span over both sides. Only one example for one white row: 201229_unassignedJunctions_reads I am not sure what to do with this information. Does this translate into mixed reads as shown in figure 5e of D&B, 2014? This would indicate a tandem duplication, which I can only find for this one particular site. The coverage plots follow average read depth for the other two sites.

The perfect way to find out, seems to me as proposed in D&B, 2014 by mutating the reference, and have a go with breseq at it again. But I am not sure how to write the lines, as given by the entries. In the paper, you could combine two entries from the unassigned table into one line in the *.gd-file. I am not sure on how to do this, for my examples, with two different repeat ends.

Do you see a way on how to solve this and could you give me a hint/advice on how to approach this further?

Thanks in advance! Annie

jeffreybarrick commented 3 years ago

There's a lot going on here. Interpreting junctions can be tricky. Here are my best guesses:

The three pairs highlighted in purple are insertions of mobile elements.

breseq can't tell the difference between ends of the same element located at different sites in the genome (they have the exact same sequence as far as the reads spanning the new junction can tell), so it ends up picking one of them somewhat arbitrarily. If you looked at those regions, you would see that they are identical, or at least identical on their margins on the scale of your read length. The orange highlighting of those rows tells you that side of the junction is matched by multiple sites in the genome.

You can help breseq give you more precise output in these cases if you can can annotate all locations of the IS element as repeat_region or mobile_element entries in the annotated reference file and give them the exact same name. Then breseq should be able to pair them up for you and calculate the target site duplication size. It can take some trial and error to figure out where the ends should be. You should generally get a clean +N bp type annotation when this is the case and not deletion or extra insertion bases around the MOBs that will be predicted.

The duplication related to the MOB you highlighted in cyan and the coverage plot likely involves this chain of events:

Insertion of new MOB (as shown by the two linked JC entries).
Recombination between this new element and an existing IS element (red lines on the ends of the duplicated entry in your graphic) resulting in the duplication. That is, there is a new IS element insertion and then that mediates a duplication between that location and an existing IS element located at the other end.

The 1st and 2nd entries look like they are integration of your plasmid into the chromosome near the origin of your circular genome. There also seems to be some kind of duplication in the chromosome associated with the integration.

The way to read the 1st entry is that a read spanning the junction starts at low coords and goes up to position 200 in the plasmid, then it jumps to 367287 in the genome and starts going up in coordinates. The 2nd entry is for starting at 245 in the plasmid and going up connected to 368508 and then going down in the chromosome.

So bases 201-244 in the plasmid were deleted. The rest of the plasmid was inserted between a target site duplication of bases 367287-368508 in the genome in a forward orientation (base 245 going up then wrapping around and going up to 201). The duplication may have occurred during whatever recombination event put the plasmid into the genome. I would predict that there is excess coverage of bases 367287-368508 if you plotted coverage in that region.

AMVeenstraSkirl commented 3 years ago

Hi, thanks for your answer.

So, I guess I will go back to the annotated reference and try to add the repeat_region or mobile_element tag to the specific regions. I already figured, that a lack of these might cause the lack of MOB-entries in the mutation list. Do I have to run all of breseq again, after the change of reference? Or is it possible to use a shortcut? (I have 36 genome sequences int total, might save some calculation time if possible) As for the purple entries, I was a little confused, because the orange rows of each pair do not correspond to one single mobile element, but belong to two different mobile elements each. This bacterial species contains 71 copies of 11 different mobile elements in its genome, as well as 2 complete and 4 partial phage sequences. Maybe after the addition of the mobile_element tag to the reference, the purple entries will be called as MOB entry. I will check this, after the adjustments of the reference.
For the plasmid entries, there is something else going on (, as well?). Because I constructed it, using a native gene from the chromosome. The missing bases 201-244 of the plasmid are bases that were cut out from the multiple cloning site, and the bases 367287 to 368508 belong to the part of the chromosome that I cloned in between there. You are right, there is excess coverage for this area. However, I interpreted this as a lot of copies in the cells from the high copy number plasmid. The coverage of this particular gene is roughly 40-times higher than the average coverage of the genome (>20,000 vs. 500).

So in summary, regarding the construction work I thought/hoped to understand these two entries as the presence of the high copy plasmid with the particular insert. I would expect any recombination event to go along with some randomness, causing some mutations or small indels in the sequence involved. However, I don't see any calls in the predicted mutation table in that particular genomic region. I suppose we cannot really tell whether the plasmid was integrated into the genome as well, or whether the huge number of plasmids have recombinated with each other? Regarding this additional information on the plasmid, would you still think that the plasmid was integrated?

Thanks, a lot!

jeffreybarrick commented 3 years ago

Unless you used the -k option to keep intermediate files, you'll need to re-run breseq from the beginning. I recommend using something like -l 80 when you are debugging genomes annotated with the IS elements to make your runs faster. I think once you get these annotated that you will find they are the same element or closely related elements that share a lot of exact sequence homology. Feel free to re-open this issue and post here if you encounter difficulties during the IS annotation process – we wish that there were a tool out there that would do this in an automated way.
Your interpretation of this as your constructed plasmid containing the gene you cloned from the genome makes sense. The high coverage here should match (+1) that of the whole plasmid that you can view on the summary page.

AMVeenstraSkirl commented 3 years ago

Thanks a lot for your quick answer!

Well, I did not run them with -k indeed. But I'll manage anyways - high performance computer cluster for the win. Pointing me towards homology between the repeat ends helped for one of the entries, that turned out to have 58 bp identity. So I am expecting those to be a similar IS, as you predicted. The other two did not have any similarity. But I will wait until I have finished the annotation of the mobile elements. In case it's a struggle, I will let you know. Otherwise, I am so much further right now already!
The coverage of the plasmid roughly matches the coverage of the chromosomal region of ~20k: Except for the part between 500 and 1000 bp that drops to below 15k, but I am willing to ignore this as within the limits of sequencing errors for my own sanity's sake....

Thanks a lot once again! I hope you will never hear from me again because that would mean my issues would be solved and breseq fulfilled all the researcher's dreams. Your doing an awesome job in maintaining breseq and suporting us all with your answers and suggestions. Keep it up! Bye

AMVeenstraSkirl commented 3 years ago

Hi Jeffrey,

I have been busy annotating my genome with the "mobile_element" entries, and some other things that were missing. I have tried to run it just now and got this error: 210115_error_invalidStartEnd_coordinates

Something with the start and end positions is not right. Most likely due to some of the positions I entered. The error just specifies the line of code in the reference_sequence.h header file.

I searched for a similar issue on here and found one of your files that defines the error message and some of the variables (start_1 and end_1) that are involved in this. However, I could not decifer the initialization of the "end_1" variable in the piece of code you have on here to get an indea on where to look at in my file.

Is this error due to the positions in one entry (gene[...]123456..234567) not being consecutive, or because of the order of the entries, or other conflicts between entries? Or all together?

Thanks!

PS. I cannot re-open the issue...

dannagifford commented 3 years ago

Replying as I've also had a reference genome with un-annotated IS regions and experienced the same issue with junctions.

"Is this error due to the positions in one entry (gene[...]123456..234567) not being consecutive"

Sort of, not 'consecutive' exactly, but for a gene on the forward strand (i.e.strand 1), start must be smaller than end. For a gene on the reverse strand (i.e. strand -1), end must be smaller than start. You've given arbitrary numbers here, but I suppose you are aware that these are the gene coordinates (in the genome) in genbank format? It seems these coordinates are not being parsed correctly for your gene, as your error message says that start = 1 and end = 0 (which fails the start<end check).

Danna

On Fri, Jan 15, 2021 at 11:59 AM AMVeenstraSkirl notifications@github.com wrote:

Hi Jeffrey,

I have been busy annotating my genome with the "mobile_element" entries, and some other things that were missing. I have tried to run it just now and got this error: [image: 210115_error_invalidStartEnd_coordinates] https://user-images.githubusercontent.com/35833695/104723391-7c92b180-572f-11eb-8bab-69d0580405cb.png

Something with the start and end positions is not right. Most likely due to some of the positions I entered. The error just specifies the line of code in the reference_sequence.h header file.

I searched for a similar issue on here and found one of your files that defines the error message and some of the variables (start_1 and end_1) that are involved in this. However, I could not decifer the initialization of the "end_1" variable in the piece of code you have on here to get an indea on where to look at in my file.

Is this error due to the positions in one entry (gene[...]123456..234567) not being consecutive, or because of the order of the entries, or other conflicts between entries? Or all together?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/barricklab/breseq/issues/255#issuecomment-760900992, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBVJ3JGSFNRNL35DVGWRMTS2AU33ANCNFSM4VNG7NNA .

AMVeenstraSkirl commented 3 years ago

Hi Danna!

Yes, I was indeed aware that the arbitrary numbers are standing for the start and end of a gene and are actually not consecutive. Sorry for the weird wording. I meant what you said:

Sort of, not 'consecutive' exactly, but for a gene on the forward strand (i.e.strand 1), start must be smaller than end. For a gene on the reverse strand (i.e. strand -1), end must be smaller than start.

Ok, so I will have to search the entries in my file again for position combiantions that are "non-complying" to this criterium. Thanks a lot!

AMVeenstraSkirl commented 3 years ago

Hi,

For anyone who encountered a similar problem, my final remarks on this:

The check on all the positions in the new entries, revealed that a typo in one of the entries created a position number greater than the total number of my genome. Check with something like grep " gene " nameOfYourReference.gbk For me I was looking for mobile element entries so i could look for "mobile_element " and non coding RNAs "ncRNA ". Save in a text-file, some re-formatting and calculating the difference between the positions can reveal unintended typos.

HOWEVER, after fixing this mistake the file would still yield the original error. Turns out, when creating the new version of my file, the actual DNA sequence at the bottom of the file was not copied together with the rest of the entries. Where it should say something like this: ORIGIN 1 attgctgact atgctcgata [insert millions of bases...] // I found my file to contain: ORIGIN //.

Fixed the issue with a copy-paste-job. Thanks for your help!

jeffreybarrick commented 3 years ago

Thanks for adding the explanations.

I've tried to add some more feedback about exactly which feature has these coordinate problems to the code, but I'm not seeing this exact error checking code being triggered in my text cases. If you care to share the reference file that was giving the "invalid start-end" error with me by email, I can see about trying to make that error message more informative.