Open francicco opened 5 years ago
We are also having this issue, thanks!
Hello, and thank you for your interest in CAMSA
!
First thing first, please make sure you are running the latest version of CAMSA
.
Now to the problem.
First, if you are trying to merge two distinct (scaffold)assemblies, please, be aware, that CAMSA needs to know a set of contigs, that both assemblies were generated on. Please, correct me if I'm wrong, but it seems that the two (de novo)assemblies that you are trying to merge were not generated from the same set of contigs (which is required by CAMSA
). But I think there is workaround.
With running fasta2camsa_points.py
the idea is that the first argument is the contigs.fasta
file, while second+ arguments are assembly_n.fasta
files, and nucmer
is invoked on pairs contigs.fasta
+ assembly_n.fasta
of files, aligning contigs in contigs.fasta
to the (scaffold)assembly in each assembly_n.fasta
file (extracting information about, order, orientation (i.e., assembly points) that CAMSA
then operates on).
If you do not have the same set of contigs that both of your assemblies are built with, you can try first extracting pairwise alignments of the two assemblies into a separate fasta file contigs.fasta
and then run fasta2camsa_points.py
as follows:
fasta2camsa_points.py contigs.fasta assembly1.fasta assembly2.fasta -o output
Now in terms of the error in your run. It seems that nucmer for some reason has not finished the alignment. Can you please provide the log of the run, so I can get a better grip of the error?
It looks like I solved the problem, apparently, nucmer runs out of memory. You have to recompile it. @jokelley, google the actual error from the nucmer error log file, I can't find the complile command now. The problem comes afterward, it takes ages. F
@francicco Great to hear that the problem went away.
Please also note the part of my response w.r.t. CAMSA
expectations about the input assemblies, contigs, etc.
And let me know if I can be of any further help!
@aganezov,
The idea that I have, but honestly don't know if it is correct, is trying to merge two assemblies produced from the same set of reads but a different strategy. But the thing you're suggesting is not very clear to me "If you do not have the same set of contigs that both of your assemblies are built with..."
F
@francicco Yeah, I believe, I understand your situation.
CAMSA
was primarily designed to merge multiple scaffold assemblies, where every scaffold assembly is build on the same set of contigs. Imaging having contigs.fasta
produced by an assembler, like SPAdes, and then with different information you have scaffolded those contigs into scaffolds using (i) jumping library assembly1.fasta
(ii) long-read based scaffolder assembly2.fasta
(iii) optical mapping assembly3.fasta
; Now using CAMSA
one can compare and merge all of the scaffold assemblies at the same time. To do so, first a convertion from fasta
to CASMA
format is needed, and that is where the following command would be executed:
fasta2camsa_points.py contigs.fasta assembly1.fasta assembly2.fasta assembly3.fasta -o output
Now in your case, I believe, you do not have the common set of contigs (i.e., building blocks) for both assemblies available. And it is required for CAMSA
. One way, I think, you can get there, is to first pairwise align you two assemblies to each other, and then extract every local alignment into contigs.fasta
file. Then both of your assemblies could be viewed as orders of oriented peaces from the file contigs.fasta
and you'll be able to use CAMSA
. In a way your assemblies would become scaffolds on the common set of contigs.
Please, let me know if this is helpful, and I'd be glad to help you in more detail if needed.
Make sense now, let's see if I understand it. I'll make a "graphical" description
Assembly1 CATCGATCGTACGTAGCTAGCTAGCTAGCTCGTACGTACGTACG
|||||||||||||||||||||||||||||||
Assembly2 TACGTAGCTAGCTAGCTAGCTCGTACGTACGTACGATCGATCGCT
Contigs.fasta -> TACGTAGCTAGCTAGCTAGCTCGTACGTACG
Assembly1 -> CATCGATCGTACGTAGCTAGCTAGCTAGCTCGTACGTACGTACG
Assembly2 -> TACGTAGCTAGCTAGCTAGCTCGTACGTACGTACGATCGATCGCT
The aligned portion will be the contig in the Contigs.fasta.
Is that right? F
That is correct! Then, with such "alignments" in contigs.fasta
, the regular CAMSA
pipeline will be feasible.
Not trivial, considering that there could be differences in the aligned portion. How would you do that? With a reciprocal blast search I would find homologous contigs, but the actual alignment? Nucmer? F
So I don't think you need the actual alignment positions of common contigs, just the contigs themselves (alignment positions would be identified by CAMSA
fasta2camsa_points.py
pipeline).
If you can get homologous contig sequences into a separate contigs.fasta
file, that should be enough. Moreover, they shall not need to identically match to the sequences in the assemblies, as in fasta2camsa_points.py
input contigs.fasta
sequences are searched with nucmer
in assemblies with a given matching threshold, so small inconsistencies shall be fine.
Also, I would suggest only going with relatively long contigs in the contigs.fasta
file, as CAMSA
only takes the single best alignment of every sequence in contigs.fasta
file when aligned to a given assembly (i.e., the assumption that every contig in contigs.fasta
appears at most once in the given assembly).
Hi @aganezov,
I'm back on CAMSA, and back trying to convert my fasta files into camsa_points. This is the command I'm running.
fasta2camsa_points.py $ASSEMBLY.fasta $ASSEMBLY.LRScaf/scaffolds.fasta EisaPacBio.assembly.v1.0.AllMapScaff.fasta -o $ASSEMBLY.CAMSA.LRScaf
Unfortunately I get the same error during nucmer
. I installed the new version v4.0.0.beta2
and changed nucmer-cli-arguments
in fasta2camsa_points.ini
adding the multithreads implemented in the new version -t 32
.
The error that I get is std::bad_alloc
. Any idea how to solve this?
Thanks F
Hello @francicco , thank for posting this!
the std:bad_alloc
is most likely coming from the mummer itself, not CAMSA utils.
Can you, please, check, that the mummer
works fine on its own?
Sergey.
So, I tried with just a bit of the genome and the nucmer step works fine, now delta-filter is running. Six hours and it's still there. Very very slow, and it's just a 26Mb. Totally unpractical for an insect genome. Is there anything that can be done do make it faster?
Thanks F
hello @francicco
So, alignment indeed takes quite a while. It's a very computationally intensive process, and, while MUMmer is a well designed software, there is nothing I can do to make it run faster.
With respect to the std:bad_alloc
error that you've seen: I sincerely doubt that it comes from CAMSA itself (it refers to the ram memory allocation problem; CAMSA is written in Python, which handles memory management automatically, and it doesn't take much memory in general), but rather from when CAMSA was running mummer alignment step.
Please, let me know if you have a small example where the error occurs, so I can investigate some more.
Hi, I'm relatively new to bioinformatics so hopefully I can get some help regarding this issue. I don't have a common set of contigs so I'm trying to create a pairwise alignment and then extract the local alignments into a new file. Is there a script that can help me do this? Or are there any tools that I can use to do this?
Hello @daneshnedaie , is your question related to the issue described above? If not, I would suggest (for the future) that you open a separate issue with your question.
With regards to your question: how many assemblies do you have? I'm not sure about a ready-to-use script, but your line of thinking is correct. Do a pairwise alignment, then use the aligned parts as contigs, and then use the standard CAMSA pipeline:
Please change 'nucmer-cli-arguments:-maxmatch -c 100' to 'nucmer-cli-arguments:--maxmatch -c 100' in Config File (camsa/utils/fasta/fasta2camsa_points.ini).
I am also getting the same error. I tried @git4waki 's method, but the same error persists.
I've corrected the nucmer-cli-arguments
value in the settings file and pushed to the master branch.
Hi,
I am also having the same error but I am a little bit confused with the input arguments. We used SuperNova to generate two haplotypes of a same genome using the same reads input and we processed the two haplotypes upt to gap filling and polishing. Now we would like to merge the two haplotypes to "boost" the assembly quality...From what I understand from the thread my haplotype1.fa and haplotype2.fa are the two assemblies added as input but for the "contigs.fa" input should I convert the raw reads used to generate the two haplotypes to fasta format and use them as my contigs.fa input?
Hi @aganezov,
I'm trying to merge two assemblies resulted from different settings with Supernova. Since I have fasta file I think I have to convert them in point formatted file by running
fasta2camsa_points.py
.This is how I'm executing the command:
(camsa_env) [fc464@login-e-12 Pdid.Merge10xChromium]$ fasta2camsa_points.py ID1090_1_renamed_bcfrac066_pseudohap_1kb.fasta ID1090_1_new_ALL_pseudohap_1kb.fasta -o Pdid.10xMerge
And this is the Error that returns me.
What am I doing wrong? Best, Francesco