bcgsc / arcs

🌈Scaffold genome sequence assemblies using linked or long read sequencing data
GNU General Public License v3.0
91 stars 16 forks source link

Understanding specific scaffolding output #163

Closed RNieuwenhuis closed 11 months ago

RNieuwenhuis commented 1 year ago

Hi @lcoombe,

I was just comparing some older work I have done some years ago to a newly published result. Using PacBio, Bionano and 10X we managed to get our assembly mostly to chromosome level and were very happy with that. There were still a hand full of scaffolds not yet complete chromosomes.

I aligned my result to a newly published result and got the feeling that these scaffolds that were not yet complete chromosomes should have been scaffolded back in the days when I used ARCS for the linked-read scaffolding.

Now I went back to my results and I am trying to find out why that did not happen, what parameters may have caused the link to be filtered out. Just for the sake of learning.

So, based on the dot-plot I know which scaffolds should have been possible to connect and in what orientation that would be possible. I know my scaffold X and Y got number 16 and 60 in the renaming step.

Now in my_assembly_prefix_c5_m50-10000_s98_r0.05_e30000_z500_main.tsv I find the following:

U    V    Best_orientation  Shared_barcodes  U_barcodes  V_barcodes  All_barcodes
16+  60+  T                 1                442         1112        4403707
60-  16-  T                 1                1112        442         4403707

Could you please explain why the combination 16+ 60- and 60+ 16- are not listed, as those are now confirmed to be the correct relative orientations? I see for other combinations of scaffolds more orientations listed. Is it filtering step that I could have tweaked back then that causes this?

lcoombe commented 1 year ago

Hi @RNieuwenhuis,

There are a lot of factors that could contribute to a join not being made, so it's honestly pretty hard for me to say for sure. Some possibilities:

It's really hard to say if any filtering would have lead to the expected output. Perhaps if you had made the c parameter less stringent (lower), there would have been more barcodes that supported this, but it's hard to say. If you wanted to know for sure, you'd need to do a failure mode analysis, where you looked into the read alignments, and decode from there why barcode support wasn't found.

RNieuwenhuis commented 1 year ago

Hi @lcoombe

Thanks for your reply, using -m 20-10000 -c 3 -e 100000 and now it connects the scaffolds that I know should be connected and a lot more shared_barcodes are found.

U   V   Best_orientation    Shared_barcodes U_barcodes  V_barcodes  All_barcodes
16- 60+ F   33  9395    9563    807069
60- 16+ F   33  9563    9395    807069
16- 60- F   11  9395    3482    807069
60+ 16+ F   11  3482    9395    807069
16+ 60+ F   15  6865    9563    807069
60- 16- F   15  9563    6865    807069
16+ 60- T   75  6865    3482    807069
60+ 16- T   75  3482    6865    807069

I understand that these are considered loose settings and that default settings are usually the most sensible ones.

What I still don't understand is why there are now a lot more orientations listed compared to the previous run. Could you maybe elaborate a bit on the reasons for that, please? Is it arcs or LINKS that causes that? Is it related to the -a setting for links? Or -l, maybe?

lcoombe commented 1 year ago

why there are now a lot more orientations listed compared to the previous run

Do you mean more links or more different orientations? You can see that only one relative orientation is seen as the 'best' - which is 16+ -> 60- (60+ -> 16- is the equivalent due to it being the reverse complement).

Figure 1 of our ARCS manuscript (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6030987/) has a schematic of the algorithm which might help you to better understand what's going on.

For deciding on the orientation supported by a given barcode, the code tallies up the number of read pairs for that barcode which map to the head (5' end) or tail (3' end) of a contig. It does a significance test to determine if the reads are more significantly found at the head or tail end, and assigns the orientation of that contig accordingly. Once all of these are tallied, there is another test to decide which of the various orientations found between a given pair of contigs, which is the most supported by the barcode data. Note that although you see all the combinations in that verbose file that you are looking at, only the best supported orientation will be found in the scaffold graph, which is traversed to output the final assembly scaffolds. The creation of the scaffold graph is done at the arcs stage, so prior to LINKS. The links found will be related to all of the parameters that are input to this arcs stage.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your interest in ARCS!