alekseyzimin / masurca

GNU General Public License v3.0
236 stars 35 forks source link

Does SAMBA list changes that were made to the assembly somewhere? #330

Open MarkusRainerSchmidt opened 9 months ago

MarkusRainerSchmidt commented 9 months ago

Hi,

I am wondering if samba outputs the changes (i.e. what sequences have been inserted, changed, removed) that it made to the assembly? I would also be fine getting that information from the intermediate files. Is format of these files documented somewhere?

I am guessing that the .patches.uniq.links.txt file holds the changes to the assembly, is that correct? I also do not fully understand the format. Every line contsins the following columns: <ctg1>.<pos1> <num1> <str1> <ctg2>.<pos2> <num2> <str2> <len> <seq> Does each line then mean that <seq> has been placed between <ctg1>.<pos1> and <ctg2>.<pos2>. If the two strands are different, it would reverse-complement one of the contigs? What do <num1> and <num2> stand for?

Thanks,

Markus

bioinfoMMS commented 5 months ago

Hi Markus,

Did you ever figure this out? I have the same questions about the samba output file formats and can't seem to find documentation for it anywhere.

Thanks!

MarkusRainerSchmidt commented 5 months ago

No, from the output files I could not figure it out.

However, I am running SAMBA in the mode, where it is only allowed to fill in gaps. So i used this information to create a script that matches the contigs (here: continuous sequences between gaps) of the input assembly against the output assembly. Since the contigs are not changed at all, you do not even need an aligner here, an exact string match (e.g. str.index in python) is enough. Then knowing where the contigs from the input are located in the output, you can reproduce the size and position of filled in gaps. Oh and you have to make sure to cut away the first and last 1000bp (-o parameter of SAMBA) of the input contigs before the matching since SAMBA will mess with these.

Hope that helps,

Markus