Suggestions on how to set up weights

desmodus1984 commented 2 months ago

Hi, I used short and long-reads to assemble a bat genome. I used pomoxis to assemble it using a reference, and I got 98 contigs. Busco score was 97%, I then mapped the reads back to the assembly, many contigs had high coverage, but some had very low (50 -60%) which signed a problem, as well as several regions with no depth. Then, I tried detecting misassemblies with CRAQ, and it found many as suggested by the mapped reads.

A group published 6 reference bat genomes, and I wanted to consult you about setting up the analysis. The main reference-grade assemblies are 6, so I wanted to know if I should give them all the same weight, or if I should give them different weights, since they are from different families, hence more distantly related to my species.

Also, I tried running ragtag because another group sequenced my same species, but didn't publish the assembly. My installation in a server didn't work when I tried using it, but later I found that https://usegalaxy.eu/ had ragtag, so I used it to correct my assembly. I have two questions,

have you compared the performance of ntJoin and ragtag? My ragtag attemp, scaffold my genome to 3722, fragments
I was planning on using it as another "reference-grade" assembly, since the group used Hi-C to scaffold the genome (4948 fragments). Thus, my question is, should I give this genome more weight than the 6 published reference-grade assemblies, or should all the reference assemblies get the same weight?

Lastly, I am planning on running it on a HPC, and I would appreciate if you could tell me how much memory ntJoin needs, and how many cores should I assign to the job. Any hints and suggestions to improve my genome assembly are highly appreciated.

Thanks;

lcoombe commented 2 months ago

Hi @desmodus1984,

In our paper (https://doi.org/10.1093/bioinformatics/btaa253), we compared ntJoin to Ragoo, which was was the predecessor of RagTag. I haven't done any tests with the newest RagTag version.
Just want to clarify your project set-up so that I can give you the best advice - you have 6 reference-grade bat assemblies of related species, and a scaffold-level assembly of the same species from another group? Is that right?
In terms of memory, for the human genome assemblies we show in our paper, the jobs required less than 11 GB of RAM. You could always request a bit more on your cluster just in case, as the memory may vary between different jobs, but that gives you a ballpark. In terms of cores, the main threaded portion of ntJoin is generating the minimizers, and that step uses a max of 5 threads, so you could set the threads at 5.

Thank you for your interest in ntJoin! Lauren

desmodus1984 commented 2 months ago

Hi Lauren,

Thanks for the quick reply. It would be very nice to see a comparison between ntJoin and the new Ragtag. You are right, that's my set-up, 6 reference grade genomes and one scaffolded draft.

Thanks.

lcoombe commented 2 months ago

Hi @desmodus1984,

Ok, thanks for clarifying!

There are a couple of considerations that you could make here:

I'd suggest using the scaffolded draft from the same species in a first, independent ntJoin run, then using the other reference-grade assemblies for a second ntJoin round after that (building on the post-ntJoin scaffolds from the first round). Then, you could consider the reference genomes with equal weights, assuming that you want them to be weighed equally for scaffolding
Consider how you want to use the cut option - it if is set to True, it means that it will use the structure of the reference assembly (or assemblies) and make cuts in your input assembly to fit the structure to the reference(s). If False, it will still scaffold, but will not break any of the scaffolds. You could do that if you think there are species-specific structures in your assembly that you want to retain.

Let me know if you have any other questions! Lauren

desmodus1984 commented 2 months ago

Hi, I wanted to ask you something. Do you know a way to eliminate the zero-depth regions? For me it doesn't make any sense them, they are false-positives, with like 100X depth, there should be no region with zero-depth.

Thanks

lcoombe commented 2 months ago

Hi @desmodus1984,

For the zero-depth regions, unfortunately you couldn't use ntJoin directly because it doesn't use any read information in the execution. You would expect full contigs with zero coverage to have no mappings to any reference, and end up in the 'unassigned' fasta (although legitimate contigs can also end up in this file)

I'm not familiar with pomoxis for assembly, but it is indeed strange to have zero coverage regions if you are re-aligning the same reads. If it were me, I would start with characterizing these regions - what proportion of the assembly do they comprise? Are they in individual contigs, or smaller regions within the contigs? Are they near any gaps? Are you filtering your alignments at all, or using more conservative parameters? This information could help you understand how substantial these regions are within your assembly. Depending on the results, you could consider removing these contigs, or masking those regions - but I would definitely ensure you fully understand those regions first!

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your interest in ntJoin!

bcgsc / ntJoin

Suggestions on how to set up weights #116