compbiol / CAMSA

CAMSA: a tool for Comparative Analysis and Merging of Scaffold Assemblies
MIT License
24 stars 4 forks source link

fasta2camsa_points - ERROR #15

Open francicco opened 5 years ago

francicco commented 5 years ago

Hi @aganezov,

I'm trying to merge two assemblies resulted from different settings with Supernova. Since I have fasta file I think I have to convert them in point formatted file by running fasta2camsa_points.py.

This is how I'm executing the command: (camsa_env) [fc464@login-e-12 Pdid.Merge10xChromium]$ fasta2camsa_points.py ID1090_1_renamed_bcfrac066_pseudohap_1kb.fasta ID1090_1_new_ALL_pseudohap_1kb.fasta -o Pdid.10xMerge

And this is the Error that returns me.

(camsa_env) [fc464@login-e-12 Pdid.Merge10xChromium]$ fasta2camsa_points.py ID1090_1_renamed_bcfrac066_pseudohap_1kb.fasta ID1090_1_new_ALL_pseudohap_1kb.fasta -o Pdid.10xMerge
================================================================================
| Sergey Aganezov & Max A. Alekseyev (c)                                       |
| Computational Biology Institute, The George Washington University            |
|                                                                              |
| Converting FASTA formatted scaffolding results for further CAMSA processing. |
|                                                                              |
| For more information refer to github.com/compbiol/camsa/wiki                 |
| With any questions, please, contact Sergey Aganezov [aganezov(at)cs.jhu.edu] |
================================================================================

Command Line Args:   ID1090_1_renamed_bcfrac066_pseudohap_1kb.fasta ID1090_1_new_ALL_pseudohap_1kb.fasta -o Pdid.10xMerge
Config File (/home/fc464/software/CAMSA/camsa/logging.ini):
  c-logging-level:   20
  c-logging-formatter-entry:%(asctime)s - %(name)-15s - %(levelname)-7s - %(message)s
Config File (/home/fc464/software/CAMSA/camsa/utils/fasta/fasta2camsa_points.ini):
  c-cov-threshold:   90.0
  c-coords-pairs-strategy:mid-point-sort
  nucmer-cli-arguments:-maxmatch -c 100
  nucmer-path:       nucmer
  show-coords-cli-arguments:-r -c -l
  show-coords-path:  show-coords
  delta-filter-cli-arguments:-r -q
  delta-filter-path: delta-filter
Defaults:
  --ensure-all:      False

2018-11-28 10:04:53,346 - CAMSA.utils.fasta2camsa_points - INFO    - Starting the converting process
2018-11-28 10:04:53,352 - CAMSA.utils.fasta2camsa_points - INFO    - Working with "ID1090_1_new_ALL_pseudohap_1kb.fasta"
2018-11-28 10:04:53,352 - CAMSA.utils.fasta2camsa_points - INFO    - Running NUCmer for "ID1090_1_new_ALL_pseudohap_1kb.fasta" scaffolds file, using "ID1090_1_renamed_bcfrac066_pseudohap_1kb.fasta" as query. This might take time.
2018-11-28 10:04:53,352 - CAMSA.utils.fasta2camsa_points - INFO    -    nucmer -maxmatch -c 100 -p /rds/project/shm37/rds-shm37-helixmbodyw/HeliconiiniProg/Pdid.Merge10xChromium/Pdid.10xMerge/fasta2camsa/ID1090_1_new_ALL_pseudohap_1kb ID1090_1_new_ALL_pseudohap_1kb.fasta ID1090_1_renamed_bcfrac066_pseudohap_1kb.fasta > /rds/project/shm37/rds-shm37-helixmbodyw/HeliconiiniProg/Pdid.Merge10xChromium/Pdid.10xMerge/fasta2camsa/logs/nucmer_ID1090_1_new_ALL_pseudohap_1kb.stdout.txt 2> /rds/project/shm37/rds-shm37-helixmbodyw/HeliconiiniProg/Pdid.Merge10xChromium/Pdid.10xMerge/fasta2camsa/logs/nucmer_ID1090_1_new_ALL_pseudohap_1kb.stderr.txt
2018-11-28 10:07:24,688 - CAMSA.utils.fasta2camsa_points - ERROR   - NUCmer exited with non-zero code, running for "ID1090_1_new_ALL_pseudohap_1kb.fasta" scaffolds file.
2018-11-28 10:07:24,688 - CAMSA.utils.fasta2camsa_points - ERROR   - NUCmer logs are stored in:
2018-11-28 10:07:24,688 - CAMSA.utils.fasta2camsa_points - ERROR   -    stdout: "/rds/project/shm37/rds-shm37-helixmbodyw/HeliconiiniProg/Pdid.Merge10xChromium/Pdid.10xMerge/fasta2camsa/logs/nucmer_ID1090_1_new_ALL_pseudohap_1kb.stdout.txt"
2018-11-28 10:07:24,688 - CAMSA.utils.fasta2camsa_points - ERROR   -    stderr: "/rds/project/shm37/rds-shm37-helixmbodyw/HeliconiiniProg/Pdid.Merge10xChromium/Pdid.10xMerge/fasta2camsa/logs/nucmer_ID1090_1_new_ALL_pseudohap_1kb.stderr.txt"
2018-11-28 10:07:24,690 - CAMSA.utils.fasta2camsa_points - ERROR   - Delta file for prefix="ID1090_1_new_ALL_pseudohap_1kb" was not found in the output folder.
2018-11-28 10:07:24,690 - CAMSA.utils.fasta2camsa_points - INFO    - Elapsed time: 0:02:31.344933

What am I doing wrong? Best, Francesco

jokelley commented 5 years ago

We are also having this issue, thanks!

aganezov commented 5 years ago

Hello, and thank you for your interest in CAMSA!

First thing first, please make sure you are running the latest version of CAMSA.

Now to the problem. First, if you are trying to merge two distinct (scaffold)assemblies, please, be aware, that CAMSA needs to know a set of contigs, that both assemblies were generated on. Please, correct me if I'm wrong, but it seems that the two (de novo)assemblies that you are trying to merge were not generated from the same set of contigs (which is required by CAMSA). But I think there is workaround.

With running fasta2camsa_points.py the idea is that the first argument is the contigs.fasta file, while second+ arguments are assembly_n.fasta files, and nucmer is invoked on pairs contigs.fasta + assembly_n.fasta of files, aligning contigs in contigs.fasta to the (scaffold)assembly in each assembly_n.fasta file (extracting information about, order, orientation (i.e., assembly points) that CAMSA then operates on).

If you do not have the same set of contigs that both of your assemblies are built with, you can try first extracting pairwise alignments of the two assemblies into a separate fasta file contigs.fasta and then run fasta2camsa_points.py as follows:

fasta2camsa_points.py contigs.fasta assembly1.fasta assembly2.fasta -o output

Now in terms of the error in your run. It seems that nucmer for some reason has not finished the alignment. Can you please provide the log of the run, so I can get a better grip of the error?

francicco commented 5 years ago

It looks like I solved the problem, apparently, nucmer runs out of memory. You have to recompile it. @jokelley, google the actual error from the nucmer error log file, I can't find the complile command now. The problem comes afterward, it takes ages. F

aganezov commented 5 years ago

@francicco Great to hear that the problem went away.

Please also note the part of my response w.r.t. CAMSA expectations about the input assemblies, contigs, etc.

And let me know if I can be of any further help!

francicco commented 5 years ago

@aganezov,

The idea that I have, but honestly don't know if it is correct, is trying to merge two assemblies produced from the same set of reads but a different strategy. But the thing you're suggesting is not very clear to me "If you do not have the same set of contigs that both of your assemblies are built with..."

F

aganezov commented 5 years ago

@francicco Yeah, I believe, I understand your situation.

CAMSA was primarily designed to merge multiple scaffold assemblies, where every scaffold assembly is build on the same set of contigs. Imaging having contigs.fasta produced by an assembler, like SPAdes, and then with different information you have scaffolded those contigs into scaffolds using (i) jumping library assembly1.fasta (ii) long-read based scaffolder assembly2.fasta (iii) optical mapping assembly3.fasta; Now using CAMSA one can compare and merge all of the scaffold assemblies at the same time. To do so, first a convertion from fasta to CASMA format is needed, and that is where the following command would be executed:

fasta2camsa_points.py contigs.fasta assembly1.fasta assembly2.fasta assembly3.fasta -o output

Now in your case, I believe, you do not have the common set of contigs (i.e., building blocks) for both assemblies available. And it is required for CAMSA. One way, I think, you can get there, is to first pairwise align you two assemblies to each other, and then extract every local alignment into contigs.fasta file. Then both of your assemblies could be viewed as orders of oriented peaces from the file contigs.fasta and you'll be able to use CAMSA. In a way your assemblies would become scaffolds on the common set of contigs.

Please, let me know if this is helpful, and I'd be glad to help you in more detail if needed.

francicco commented 5 years ago

Make sense now, let's see if I understand it. I'll make a "graphical" description

Assembly1         CATCGATCGTACGTAGCTAGCTAGCTAGCTCGTACGTACGTACG
                               |||||||||||||||||||||||||||||||
Assembly2                      TACGTAGCTAGCTAGCTAGCTCGTACGTACGTACGATCGATCGCT

Contigs.fasta -> TACGTAGCTAGCTAGCTAGCTCGTACGTACG
Assembly1 -> CATCGATCGTACGTAGCTAGCTAGCTAGCTCGTACGTACGTACG
Assembly2 -> TACGTAGCTAGCTAGCTAGCTCGTACGTACGTACGATCGATCGCT

The aligned portion will be the contig in the Contigs.fasta.

Is that right? F

aganezov commented 5 years ago

That is correct! Then, with such "alignments" in contigs.fasta, the regular CAMSA pipeline will be feasible.

francicco commented 5 years ago

Not trivial, considering that there could be differences in the aligned portion. How would you do that? With a reciprocal blast search I would find homologous contigs, but the actual alignment? Nucmer? F

aganezov commented 5 years ago

So I don't think you need the actual alignment positions of common contigs, just the contigs themselves (alignment positions would be identified by CAMSA fasta2camsa_points.py pipeline).

If you can get homologous contig sequences into a separate contigs.fasta file, that should be enough. Moreover, they shall not need to identically match to the sequences in the assemblies, as in fasta2camsa_points.py input contigs.fasta sequences are searched with nucmer in assemblies with a given matching threshold, so small inconsistencies shall be fine.

Also, I would suggest only going with relatively long contigs in the contigs.fasta file, as CAMSA only takes the single best alignment of every sequence in contigs.fasta file when aligned to a given assembly (i.e., the assumption that every contig in contigs.fasta appears at most once in the given assembly).

francicco commented 5 years ago

Hi @aganezov,

I'm back on CAMSA, and back trying to convert my fasta files into camsa_points. This is the command I'm running.

fasta2camsa_points.py $ASSEMBLY.fasta $ASSEMBLY.LRScaf/scaffolds.fasta EisaPacBio.assembly.v1.0.AllMapScaff.fasta -o $ASSEMBLY.CAMSA.LRScaf

Unfortunately I get the same error during nucmer. I installed the new version v4.0.0.beta2 and changed nucmer-cli-arguments in fasta2camsa_points.ini adding the multithreads implemented in the new version -t 32. The error that I get is std::bad_alloc. Any idea how to solve this?

Thanks F

aganezov commented 5 years ago

Hello @francicco , thank for posting this!

the std:bad_alloc is most likely coming from the mummer itself, not CAMSA utils. Can you, please, check, that the mummer works fine on its own?

Sergey.

francicco commented 5 years ago

So, I tried with just a bit of the genome and the nucmer step works fine, now delta-filter is running. Six hours and it's still there. Very very slow, and it's just a 26Mb. Totally unpractical for an insect genome. Is there anything that can be done do make it faster?

Thanks F

aganezov commented 5 years ago

hello @francicco

So, alignment indeed takes quite a while. It's a very computationally intensive process, and, while MUMmer is a well designed software, there is nothing I can do to make it run faster. With respect to the std:bad_alloc error that you've seen: I sincerely doubt that it comes from CAMSA itself (it refers to the ram memory allocation problem; CAMSA is written in Python, which handles memory management automatically, and it doesn't take much memory in general), but rather from when CAMSA was running mummer alignment step.

Please, let me know if you have a small example where the error occurs, so I can investigate some more.

daneshnedaie commented 5 years ago

Hi, I'm relatively new to bioinformatics so hopefully I can get some help regarding this issue. I don't have a common set of contigs so I'm trying to create a pairwise alignment and then extract the local alignments into a new file. Is there a script that can help me do this? Or are there any tools that I can use to do this?

aganezov commented 5 years ago

Hello @daneshnedaie , is your question related to the issue described above? If not, I would suggest (for the future) that you open a separate issue with your question.

With regards to your question: how many assemblies do you have? I'm not sure about a ready-to-use script, but your line of thinking is correct. Do a pairwise alignment, then use the aligned parts as contigs, and then use the standard CAMSA pipeline:

git4waki commented 4 years ago

Please change 'nucmer-cli-arguments:-maxmatch -c 100' to 'nucmer-cli-arguments:--maxmatch -c 100' in Config File (camsa/utils/fasta/fasta2camsa_points.ini).

Adarsh931 commented 4 years ago

I am also getting the same error. I tried @git4waki 's method, but the same error persists.

aganezov commented 4 years ago

I've corrected the nucmer-cli-arguments value in the settings file and pushed to the master branch.

mas160 commented 4 years ago

Hi,

I am also having the same error but I am a little bit confused with the input arguments. We used SuperNova to generate two haplotypes of a same genome using the same reads input and we processed the two haplotypes upt to gap filling and polishing. Now we would like to merge the two haplotypes to "boost" the assembly quality...From what I understand from the thread my haplotype1.fa and haplotype2.fa are the two assemblies added as input but for the "contigs.fa" input should I convert the raw reads used to generate the two haplotypes to fasta format and use them as my contigs.fa input?