ganlab / GALA

Long-reads Gap-free Chromosome-scale Assembler
MIT License
68 stars 17 forks source link

Single command error, then issues with individual commands. #1

Closed Dikaryotic closed 4 years ago

Dikaryotic commented 4 years ago

Hiya,

I saw the preprint on bioRxiv and thought I'd give GALA a go. I've ran into two issues though pretty right off the bat.

I've two different genome assemblies listed in drafts.txt, with full file pathways and formatted as per the github instructions.

The first issue was when I run the following command: ./gala drafts.txt fa cns_final.fasta.gz nanorepore-corrected

I get this error: Traceback (most recent call last): File "./gala", line 51, in comp_generator(genomes=draft_names,output=workdir) File "/scale_wlg_nobackup/filesets/nobackup/landcare00072/Genome/Programs/GALA/src/comp_generator.py", line 11, in comp_generator z=list(open(genomes)) IOError: [Errno 2] No such file or directory: 'drafts.txt'

Apparently it can't find drafts.txt even though it's in the working directory, as all the other files are! Any thoughts?

Following that I tried running the steps individually and ran into an error at the mdm step.

./comp drafts.txt worked fine, as did sh draft_comp.sh, however when I ran ./mdm comparison/ 2 I get this issue:

./mdm comparison/ 2 Traceback (most recent call last): File "./mdm", line 14, in from cut_gathering import cut_gathering ImportError: No module named cut_gathering

Cheers, Chris

mawad89 commented 4 years ago

Thanks, Chris for reporting these bugs; You can use the last commit now.

But, mdm can't give correct results from only 2 assemblies. mdm module requires at least 3 drafts to determine mis-assembled regions.

Dikaryotic commented 4 years ago

I've just had a go with the latest commit and three genome assemblies, I've run into this for the first error:

./gala drafts.txt fa cns_final.fasta.gz nanopore-corrected Traceback (most recent call last): File "./gala", line 54, in os.chdir(new_dir) OSError: [Errno 20] Not a directory: '/scale_wlg_nobackup/filesets/nobackup/landcare00072/Genome/Programs/GALA/gala'

I then went and ran through the manual steps and got a little further this time, but hit this error at the last step of the MDM module:

./newgenome drafts.txt gathering/ Traceback (most recent call last): File "./newgenome", line 28, in genomes(genomes=draft,gathering=gathering,gathering_name=name,outpath=output) File "/scale_wlg_nobackup/filesets/nobackup/landcare00072/Genome/Programs/GALA/src/new_genome.py", line 82, in genomes b=new_genome(cut_file=gathering+gatheringname+''+a+'_cuts.txt',old_genome=aa,outpath=outpath,name='new'+a) File "/scale_wlg_nobackup/filesets/nobackup/landcare00072/Genome/Programs/GALA/src/new_genome.py", line 11, in new_genome a=list(open(cut_file)) IOError: [Errno 2] No such file or directory: 'gathering/new_genome_draft_1_cuts.txt'

mawad89 commented 4 years ago

I will submit a new commit for single command very soon For manual steps you can run the below command ./newgenome drafts.txt gathering/ -f gathering

Also before running the CCM module you need to change the name of mdm comparison directory. otherwise it will write the CCM files in the same folder.

Ural-Yunusbaev commented 4 years ago

Hey there! I already have 3 assemblies from different assemblers as in the following input file: cat draft_names_paths.txt draft_1=~/Assembly/Ra_assembly_Pilon_polished.fasta draft_2=~/Assembly/Flye_assembly_Pilon_polished.fasta draft_3=~/Assembly/NexDenovo_assembly_NextPolish_polished.fasta The drafts are corrected. These assemblies are from non-corrected ont_reads: ont_reads=~/Assembly/porechop.fq I am going to run the gala in a single command mode like this: gala draft_names_paths.txt fq $ont_reads -nanopore -a flye -o SingleCommandMode Also, I would like to skip the preliminary step because my assemblies are already done. Is my command correct for the above task? Is -nanopore for draft_names_paths.txt or for $ont_reads? Thanks in advance!

Cheers, Ural Yunusabev

mawad89 commented 4 years ago

Hey Ural Just change -nanopore to nanopore-raw and be sure that SingleCommandMode folder exists.

Ural-Yunusbaev commented 4 years ago

gala draft_names_paths.txt fq $ont_reads -nanopore-raw -a flye -o SingleCommandMode reported: Traceback (most recent call last): File "/home/crciv/soft/gala/gala", line 54, in <module> os.chdir(new_dir) OSError: [Errno 2] No such file or directory: 'Acer_SingleCommandMode/gala' I checked it. The SingleCommandMode folder exists. ls -l drwxr-xr-x 3 crciv crciv 26 May 18 20:55 Acer_SingleCommandMode Then I ran: gala draft_names_paths.txt fq $ont_reads -nanopore-raw -a flye It is running well. It looks like somthing wrong here: -o SingleCommandMode

Ural-Yunusbaev commented 4 years ago

Another bug! '[M::main] CMD: minimap2 -x asm5 /home/crciv/AcerChrAssemb/GALA/gala/new_genomes/new_draft_3.fa /home/crciv/AcerChrAssemb/GALA/gala/new_genomes/new_draft_2.fa [M::main] Real time: 16.140 sec; CPU: 39.736 sec; Peak RSS: 1.634 GB Traceback (most recent call last): File "/home/crciv/soft/gala/gala", line 82, in <module> scaffolding(path='comparison',number_of_drafts=number_of_drafts,block=block,percentage=percent,shortage_contig=contig,quality=qty,out_file=True,output_name=name,output=outpath) File "/home/crciv/soft/gala/src/scaffolding.py", line 163, in scaffolding for bas in a[e][ba]: KeyError: 'Ctg8_pilon_1_awad'

mawad89 commented 4 years ago

gala draft_names_paths.txt fq $ont_reads -nanopore-raw -a flye -o SingleCommandMode reported: Traceback (most recent call last): File "/home/crciv/soft/gala/gala", line 54, in <module> os.chdir(new_dir) OSError: [Errno 2] No such file or directory: 'Acer_SingleCommandMode/gala' I checked it. The SingleCommandMode folder exists. ls -l drwxr-xr-x 3 crciv crciv 26 May 18 20:55 Acer_SingleCommandMode Then I ran: gala draft_names_paths.txt fq $ont_reads -nanopore-raw -a flye It is running well. It looks like somthing wrong here: -o SingleCommandMode

because the exists folder not SingleCommandMode but Acer_SingleCommandMode/gala

mawad89 commented 4 years ago

Another bug! '[M::main] CMD: minimap2 -x asm5 /home/crciv/AcerChrAssemb/GALA/gala/new_genomes/new_draft_3.fa /home/crciv/AcerChrAssemb/GALA/gala/new_genomes/new_draft_2.fa [M::main] Real time: 16.140 sec; CPU: 39.736 sec; Peak RSS: 1.634 GB Traceback (most recent call last): File "/home/crciv/soft/gala/gala", line 82, in <module> scaffolding(path='comparison',number_of_drafts=number_of_drafts,block=block,percentage=percent,shortage_contig=contig,quality=qty,out_file=True,output_name=name,output=outpath) File "/home/crciv/soft/gala/src/scaffolding.py", line 163, in scaffolding for bas in a[e][ba]: KeyError: 'Ctg8_pilon_1_awad'

If you can share the files in gala_results/gap_free_comp/comparison folder with me (I tried different data-sets but I don't see this error)

Dikaryotic commented 4 years ago

Hiya,

I've had a go with the new commit and got further through with the SingleCommandMode this time. It's hit an error in the LGAM module I think. I'm getting this error, which looks to be something to do with header formatting leading to downstream issues.

[main] Version: 0.7.17-r1188 [main] CMD: bwa index new_draft_3.fa [main] Real time: 0.006 sec; CPU: 0.003 sec [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 1430 sequences (10040726 bp)... samtools sort: failed to read header from "-" [main_samview] fail to read the header from "-". samtools index: "mapping/new_draft_1/mapping.bam" is in a format that cannot be usefully indexed [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 1430 sequences (10040726 bp)... samtools sort: failed to read header from "-" [main_samview] fail to read the header from "-". samtools index: "mapping/new_draft_2/mapping.bam" is in a format that cannot be usefully indexed [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 1430 sequences (10040726 bp)... samtools sort: failed to read header from "-" [main_samview] fail to read the header from "-". samtools index: "mapping/new_draft_3/mapping.bam" is in a format that cannot be usefully indexed [main_samview] fail to read the header from "mapping.bam". [main_samview] fail to read the header from "mapping.bam". [main_samview] fail to read the header from "mapping.bam". [E::hts_open_format] Failed to open file "" : No such file or directory samtools view: failed to open "" for reading: No such file or directory [E::hts_open_format] Failed to open file "" : No such file or directory samtools view: failed to open "" for reading: No such file or directory [E::hts_open_format] Failed to open file "" : No such file or directory samtools view: failed to open "" for reading: No such file or directory gala 1.0.0

mawad89 commented 4 years ago

Hi Chris, The LGAM module steps is the easiest but it need some time and computational resources. It looks for me there are a problem in new genome indexing using bwa. So, firstly if you can check if gala_results/new_genome directory contains the three new fa files. and gala_results/gap_free_comp/gathering directory contains the scaffolding files .scaff

Then you can follow the LGAM module from manual steps: starting from genome indexing then mapping .... for indexing you can go to new_genomes directory and run the following command for i in *fa; do bwa index $i; done; then map each of them using: bwa mem -x ont2d new_draft_1.fa long-reads|samtools sort|samtools view -Sb > draft1.bam bwa mem -x ont2d new_draft_2.fa long-reads|samtools sort|samtools view -Sb > draft2.bam bwa mem -x ont2d new_draft_3.fa long-reads|samtools sort|samtools view -Sb > draft3.bam Also you can try minimap2 for mapping instead of bwa
then follow the rest of steps. However, I will be with you step by step just tell me if you face any other problem.

Dikaryotic commented 4 years ago

Hiya,

Thanks for all the help! I checked that those files existed, and they did. So I went and tried to run the bwa mem mapping as you suggested but hit the same problem. I eventually thought to check the contents of those files only to find them empty. I backtracked some and discovered that somehow my original genome input files were also empty, which is pretty bizarre as they were direct copies from the original assembly files. I've gone and re-copied them across and now the pipeline seems to be running properly. It's currently at the bwa mapping step, so fingers crossed this is the last I bother you!

mawad89 commented 4 years ago

Thanks Chris, for your information. The good part it is running now (:. I will keep this issue open until you finish the run

Dikaryotic commented 4 years ago

Hiya,

It managed to run all the way through to the final assembly step, then an error cropped up with my Canu assembling. I use Canu 2.0 and I suspect that may be the problem. Could you possibly give some detailed instructions for that final assembly step? Mostly I'm wanting to know what read files I should be assembling and how to process them at the end. I'm guessing I should be assembling each scaffold_.read.fq in the mapping/newdraft folders?

Also, what are the chances of writing in an option to restart the analysis at this step? It'd be good to be able to run the bulk through and then test the final assembly step with different assemblers as youve suggested.

Cheers, Chris

Dikaryotic commented 4 years ago

Hi again,

I've had some time to dig a bit deeper. I had a go manually assembling the scaffold_.read.fq files with both Canu and Flye and encounter the same errors as before. I took a look at the fq files and it looks like each sequence has a single extra quality score character more than the actual sequence.

mawad89 commented 4 years ago

Hi Chris Try the new commit with the below command

for i in gala_results/mapping//read_names; do ./readsep cns_final.fasta.gz $i -f fa;done;

I will work next days on resume option

Dikaryotic commented 4 years ago

Heya,

I had a go with the new commit. It's outputting the correct number of false quality scores, but it looks like the @ isn't being added to the first line of the fastq output. Otherwise I think it looks ok. I've included an example output sequence for reference.

ontsa_id660951(0_2168_2160_2168_19_1987.065861)(0_2160_2153_2160_18_15_96.541814) GTCTGAGATAAGTCCCGGTCCCGACTTGAGGTGGCTTAGGTCAGCGCCAGTTCGGAGCCTATCTTCTCGCATCCTTTCGCCACCCATCCACCCGCACCTTTTCCAGCTGTATGCCACCGCTTTAAGTTTCCGCGTGTGCAGGCATGTTCTGCTCTCCCACACATTTTTCCCACCCGCGGGGCTCAAGCCCAGCCACCCATTAACACATATTATCTTATATCTTACACCCTTATTAGACCCATAAGTCTCAGCCAACACAGTTGTCGACCTGACATTTCTGCACGTCCGATGAGGTAATTACTAAAGCCGAAACGATTCAAAGTTGCGATATGATATCCAATGAACAGGTTTCAACAAACAAAACAGCGATTCCATTTCCAAAACGGACAAGTACGAGTAGGTTCAACTTCAATGAATATGGATATTGAGTTCAAGTATTCCGTTTATGAGCATTTTCGCCTTCGATTCAGAGAGTGACAGGGACACATCCGATAGAGACCAATCCACCCTGAAAGGTCGACTAAGAGTTCATACTTAAGTCATAGGAAATCCGAAAATAGCTTACCCAAGCATTATCGTCGCAGCATACGGCCTGGGCAGAGCAGCCACTGTTTCCAACGCCGATGACGGTGATGGGGGAGCAGGTGACACCAAGCAAAACGTCGACATCTTGAACAACGACTCCGATGAGACCGAGGAGAAGGGAACCAGCAGCGGAACCAGCCTGTGAGTAGGTCAAACAAGGAGACGTCAGCAAAATGCTTTGTTGGGTGTTGCAAGAAATCTTTGATCTTACAGTCTCAGTGCTGTCGCAGCATTGAAGCTCACCAGTGCTGCAACTACTGGCGGGTTCGGTCGCAGTAGTCCCGGGCTATATACGAGAAAAGATAGCGGTGGTGGTTCAGCATCATGATAAGTGATCAGCAACGACTCGAGCACTAATCGCCAATCGCTTACCGCAGTGACCGTAATCGTGGTGGTGATGGGTTGATTTCCCGGGGTGGCGGCAGCAAGGATAACAAGCGCGGACAAGGCGTAGAAAGTAACGGCGGAGACGCGAGAGAGACATCTTTACTTGAGAGCTTAGAGTGTATCTGTATCTGAAAATCTCGAGGGAAGTTCGATGCTTCGACTCTTGCTCTTTGACCAAGCTTTTATACCCAAATTTCGGTAATGACCGTGAGGTCAGGGCTCTGCTCCTCAAAGATATGTCAGAAAACGCCGCCGCACAAATGTCAGATGCGGGCACTGGTACAGGCTGAACAGGAAAATAAAATTGTTATTCATCGATGCACGCTTGAGCATCGGTTCTGAAATTTGAGCAGGCGGTTGAAGTGAAATCCGTGTGGCGCCGTATCTATTGCCAGAGGAAAAAAGACGGGAAATATGAGAATGAACACTTGATAGAATGCGTATGAAAGTGAAGTCATTAACACTTTAGGGGACAAAAGGAATACCTGAACAGCACGTAACATGCATGATGCATTCAGCGGAGCGGGAAAATACCAAGATTGACATGATTTACCAGTCAGAAAAGAAAAGAAAAAATCAATGACTACTGCACCCGACCACGCAGTCTCATTAGTACATTGCTGCAAAGAAGTTAAAAGCTATACTGTATCTCCAAAGACAGGTGTCGGTTGTTGATTTGTTCCTGTACCAGCGACTCGGGGTTTTCGGAAATGGTCTCAGAAAACCGCGAGGACAGGGATACGGGTATCGAAAAAAACCAGGTTTTGGTTGACCCCTGAACAACAGCGCACAGAAATGTAGGGAGAACGGGTCCAAACCCATTGAACGACAAATCGAGAAGCCTCTAGGTAAGCTCTAGGTTTCTATGATTGGTTCCGGATGGTTCCTTACAGTCTGATCTCTATGACTAGTCGAGTGAAAGCTGTCAATGGATCGTGATGCTTAGTATGAGTATGTAGCTCAACCAAACATGAAATTTACGACGTAGGTGGAGTTCTTCTGGGAAAGACGGTAAAAACTTGGCGATTTTCCCCTCCTTTGAGACAATACATGACTGCCCAAAGAGTCAAGCATTTAGTTTCGGTGACATGAGGAGCTAATAGGGACTAGGAATTCATTGACATCCTGATCAGCTGTCAGGCGAGAGGAATACATATGAAGTCAAGAAGTACTTACCTCAT+ BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

mawad89 commented 4 years ago

Thanks again; The fastq file looks have 2 problems @ at the first and + included in sequence line so each read will have only 3 lines not 4. However I solved it in the new commit. You can use the last command again with the new commit

If you want to save some time you can use the following command to solve the current fastq files In case all your reads start with the pattern ontsa

for i in gala_results/mapping//fq; do cat $i | sed -e 's/+/\n+/g'|sed -e 's/ontsa/@ontsa/g'> $i.a && mv $i.a $i ;done;

Dikaryotic commented 4 years ago

Heya,

That's worked well :) I was wondering if you had advice/a protocol for this final step:

Finally, map the LGAM outcomes against one of the preliminary draft assemblies to confirm that all the contigs in the linkage group are assembled to the right chromosome/Scaffold.

I can do the mapping easily enough

Ural-Yunusbaev commented 4 years ago

Hi Chris, bravo! You are lucky! I failed there. Can you share the commands that you used? Thanks! Cheers, Ural Yunusbaev

Dikaryotic commented 4 years ago

Hmm, well I'd used the single command option which ran all the way through to the final LGM step. So currently I have a couple of folders that have the input data split into linkage groups, which I'm currently assembling manually.

If you havn't already, I'd recommend getting the newest commit and try that.

Ural-Yunusbaev commented 4 years ago

Yes, I have tried the updated single command option according to Readme. But something is wrong with folder temporary names or folder option usage. I have tried different folder names and options, but it reports an error. Could you update the Readme according to the exact commands that you use in the single command option? It would be great if you share a toy file and exact command to run the single command option with the toy file. Cheers, Ural Yunusbaev

mawad89 commented 4 years ago

Hi Ural Yunusbaev I am preparing this toy file now. once I finish it I will share it with you

Best wishes