HMPNK / CSA2.6

Chromosome Scale Assembler: A high-throughput chromosome scale genome assembly pipeline for vertebrate genomes
MIT License
10 stars 6 forks source link

CSa fails to go beyond step2 #5

Closed estolle closed 3 years ago

estolle commented 3 years ago

Hi there

I am trying to get CSA running since a while but it fails always at step 2. The test dataset works and runs through tho. It seems its something to do with Ragout, but the log is not useful to find out what the actual problem is. It always fails with the error "ERROR: Error reading permutations". Previously my problem was htat I was using zsh as shell and it choked on the bash scripts CSA generated, specifically on the multi-line commands. Using bash now, the test dataset worked but my actualy dataset still doesnt. The

cat parallel.log index file FASTLAST.fai not found, generating... awk: /home/ek/progz/CSA2.6/INSTALL/../script/update_maf_coords.awk:14: warning: escape sequence \.' treated as plain.'

cat ragout.log [12:19:10] INFO: Cooking Ragout... [12:19:10] INFO: Reading FASTA with contigs [12:19:13] INFO: Converting MAF to synteny Parsing MAF file Started initial compression Simplification with 30 10 Simplification with 100 100 Simplification with 500 1000 Simplification with 1000 5000 Simplification with 5000 15000 [12:19:15] INFO: Running Ragout with the block size 160000 [12:19:15] ERROR: Error reading permutations

In the folder mafworkdir/160000, the files are empty, the other ones up to 80000 are ok. I guess this is the problem but dunno how to fix this.

the content of the files in this folder are this: $ cat blocks_coords.txt Seq_id Size Description

$ cat coverage_report.txt Seq_id Size Description


genomes_permutations.txt is empty.

Any ideas how to fix this?

estolle commented 3 years ago

I tried to remove the error-checks from the step2 bash script (pipefail) and rerun that script but it shows me these errors:

Do 14. Jan 14:57:38 CET 2021 RUN RAGOUT TO ORDER CONTIGS ACCORDING TO REFERENCE

awk: fatal: cannot open file `./RAGOUT_Anthophora.plumipes.scaffolded.namefixed.fa/scaffolds.links' for reading (No such file or directory)

Do 14. Jan 15:00:07 CET 2021 RUN RAGOUT TO ORDER CONTIGS ACCORDING TO REFERENCE

awk: fatal: cannot open file `./RAGOUT_Anthophora.plumipes.fa/scaffolds.links' for reading (No such file or directory) ln: failed to create symbolic link 'Anthophora_CSA1_out.step2.fa': File exists

Not sure if I get these becasue I am rerunning the step manually.

I otherwise thought the problem is the 160000 blocksize which perhaps is not present in my data. I could not find the place where to deactivate that block size in the ragout scripts though

estolle commented 3 years ago

Still I havn't been able to fix the issue. Another, larger dataset seems to work, and again, a subset (10 Genomes) fails with the same error as described above. For the 2-genome-dataset describes above I managed to do the analysis using cactus alignment (hal) and then ragout 2. The result seems good so in that case I can recommend it as an alternative. For the 10 genomes dataset I however would like to use CSA for its longread integration.

HMPNK commented 3 years ago

Hi,

sorry for the late reply, have not been here for a while. Could you give me the commandline you run used to generate the CSA bash files? So you say you have some data taht work and some that don't? (What is the difference of these datatypes)

HMPNK commented 3 years ago

Regarding "I otherwise thought the problem is the 160000 blocksize which perhaps is not present in my data. I could not find the place where to deactivate that block size in the ragout scripts though"

you could remove "160000," from the following scripts in the CSA script folder:

02_ORDERCONTIGS.pl: awk 'BEGIN{print \".tree = ($out:0.01,$queryname:0.01);\n.target = $out\n.maf = $m[-1].maf\n.blocks = 160000,80000,40000,20000,10000,5000\n\n$queryname.draft = true\n$out.fasta = $contigs\"}' > $m[-1].recipe.txt 04_ORDERCONTIGS.pl: awk 'BEGIN{print \".tree = ($out:0.01,$queryname:0.01);\n.target = $out\n.maf = $m[-1].maf\n.blocks = 160000,80000,40000,20000,10000,5000\n\n$queryname.draft = true\n$out.fasta = $contigs\"}' > $m[-1].recipe.txt

HMPNK commented 3 years ago

But anyway, it seems you are having highly fragmented assemblies. I do not recommend to use CSA if contig N50 is less than 1Mb.

HMPNK commented 3 years ago

I just found that the problem in RAGOUT "ERROR: Error reading permutations" is related to fasta headers in the provided assembly file (option "-C"). If I used fasta headers like ">R2_1:0-39461144" (like generated by bedtools), Ragout crashed in step2. When I changed the headers to a simple format like "scf1, scf2, ..., scfn" it worked. Hope this solves your problem.

estolle commented 3 years ago

oh great you tracked down the issue. I think I came across a similar problem with ragout (when used separately). At least next time I can avoid it =) thank very much