bonsai-team / matam

Mapping-Assisted Targeted-Assembly for Metagenomics
GNU Affero General Public License v3.0
19 stars 9 forks source link

sort: unrecognized option '--parallel' #56

Closed zhssakura closed 6 years ago

zhssakura commented 6 years ago

Hi, when I run matam with the test dataset (16sp.art_HS25_pe_100bp_50x.fq), I got the following errors, do you have any ideas about why this happen? The commands I used was also provided Thanks, Shan

Commands:

module load python/3.5.2

python3 /home/z5095298/anaconda3/bin/matam_assembly.py -d /srv/scratch/z5095298/Julie/SPONGES_FOR_TT/13_MATAM/SILVA_128_SSURef_NR95 -i /home/z5095298/anaconda3/examples/16sp_simulated_dataset/16sp.art_HS25_pe_100bp_50x.fq -o /srv/scratch/z5095298/Julie/SPONGES_FOR_TT/13_MATAM/example_matam_assembly --cpu 14 --max_memory 10000 -v

error message:

INFO - === Alignment filtering === sort: unrecognized option '--parallel' Try `sort --help' for more information. INFO - Good alignments filtering terminated in 0.0896 seconds wall time

...

TIME: Reference fasta file read in 0.78 seconds. INFO: 76956 reference sequences were loaded

TIME: References names loaded from the SAM file in 0 seconds. INFO: 0 references are present in the SAM file

TIME: SAM file reading finished in 0 seconds. INFO: 0 bam records were read, representing 0 reads INFO: 0 bam record were mapped on a reference, representing 0 mapped reads

INFO - Overlap-graph building terminated in 0.8222 seconds wall time INFO - Overlap graph stats: 0 nodes, -1 edges

INFO - === Graph compaction & Components identification ===

--- Loading --- Source or Target label absent from the edges csv file 0 nodes loaded

--- Run algorithms --- -> BFS... -> Contraction... 0 nodes in the contracted graph 0 edges in the contracted graph -> Absorb fingers... 0 nodes in the contracted graph 0 edges in the contracted graph

--- Filtering --- -> Filtering nodes... 0 nodes in the filtered graph -> Filtering edges... 0 edges in the filtered graph

--- Splicing ---

--- Saving results --- -> Saving components -> Saving meta graph

--- Program ended ---

INFO - Graph compaction & Components identification terminated in 0.0079 seconds wall time Traceback (most recent call last): File "/home/z5095298/anaconda3/bin/matam_assembly.py", line 1714, in

loic-couderc commented 6 years ago

Hi @zhssakura,

The --parallel option is available with the GNU sort from GNU coreutils since 2010-10-15 (8.6 version).

A simple solution could be to upgrade to a newer version of coretutils. If i assume than you manage your pacakges with apt:

sudo apt-get install --only-upgrade coreutils

or use the one from anaconda:

conda install -c bioconda coreutils 

Could you provide us some more info on your operating system and the version of coreutils you are using?

sort --version
zhssakura commented 6 years ago

Hi, Thanks for the reply! I checked GNU with the commands 'sort --version', the information is:

sort (GNU coreutils) 8.4 Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.

Now this problem is solved by our IT servicer. Now it is:

sort (GNU coreutils) 8.28 Copyright (C) 2017 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

Since MATAM is installed on our server, I change my commands as below:

` module load python/3.5.2 module load java/6u45 module load gcc/4.9.2 module load sparsehash/2.0.3 module load matam/1.4.0

matam_assembly.py -d /srv/scratch/z5095298/Julie/SPONGES_FOR_TT/13_MATAM/SILVA_128_SSURef_NR95 -i /srv/scratch/z5095298/Julie/SPONGES_FOR_TT/13_MATAM/examples/16sp_simulated_dataset/16sp.art_HS25_pe_100bp_50x.fq --cpu 14 --max_memory 10000 -v -o /srv/scratch/z5095298/Julie/SPONGES_FOR_TT/13_MATAM/example_matam_assembly `

However I got another error:

INFO - === Contigs assembly === INFO - Save components to fastq files INFO - Assemble components multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/share/apps/python/3.5.2/lib/python3.5/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/share/apps/python/3.5.2/lib/python3.5/multiprocessing/pool.py", line 47, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/share/apps/matam/1.4.0/scripts/components_assembly.py", line 159, in assemble_component estimated_cov = estimate_coverage(in_fastq, fasta_file) File "/share/apps/matam/1.4.0/scripts/components_assembly.py", line 132, in estimate_coverage contigs_nt = nucleotidic_number(contigs_fa) File "/share/apps/matam/1.4.0/scripts/components_assembly.py", line 120, in nucleotidic_number with open(fastx, 'r') as fastx_handle: FileNotFoundError: [Errno 2] No such file or directory: '/srv/scratch/z5095298/Julie/SPONGES_FOR_TT/13_MATAM/example_matam_assembly/workdir/components_assembly/component26_reads_assembly_wkdir/assembly.fasta' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/share/apps/matam/1.4.0/bin/matam_assembly.py", line 1714, in exit_code = main() File "/share/apps/matam/1.4.0/bin/matam_assembly.py", line 1217, in main args.cpu, args.read_correction, args.contig_coverage_threshold) File "/share/apps/matam/1.4.0/scripts/components_assembly.py", line 228, in assemble_all_components fasta_list = pool.starmap(assemble_component, params) File "/share/apps/python/3.5.2/lib/python3.5/multiprocessing/pool.py", line 268, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/share/apps/python/3.5.2/lib/python3.5/multiprocessing/pool.py", line 608, in get raise self._value FileNotFoundError: [Errno 2] No such file or directory: '/srv/scratch/z5095298/Julie/SPONGES_FOR_TT/13_MATAM/example_matam_assembly/workdir/components_assembly/component26_reads_assembly_wkdir/assembly.fasta'

**Could please help me with this? Thank you very much!

Best,

Shan**

loic-couderc commented 6 years ago

Hi @zhssakura,

With the log above, i'm suspecting that something go wrong with SGA. With a regular run, we are expecting the followings files in /srv/scratch/z5095298/Julie/SPONGES_FOR_TT/13_MATAM/example_matam_assembly/workdir/components_assembly/component26_reads_assembly_wkdir directory:

As the assembly.fasta seems to be missing for you, could you provide the assembly.log to check this? If nothing go wrong with SGA, could you run MATAM in debug mode and send us the matam.log file. Same command as above with this extra parameters:

--debug > /pathto/matam.log 2>&1

Thank you.

zhssakura commented 6 years ago

Hi, Thanks! Here are the files in /srv/scratch/z5095298/Julie/SPONGES_FOR_TT/13_MATAM/example_matam_assembly/workdir/components_assembly/component26_reads_assembly_wkdir directory: assembly.log

and matam.log in debug module

matam.log Thank you very much!

loic-couderc commented 6 years ago

There is a problem with SGA which can't find the libbamtools.so.2.4.1 shared library. To be able to reproduce this bug, can you provide us with the version of your operating system. For example, with ubuntu:

lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.2 LTS
Release:    16.04
Codename:   xenial

or can you ask to your IT service to retrieve this info?

zhssakura commented 6 years ago

Hi, Here is the information of our server:

LSB Version: :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch Distributor ID: CentOS Description: CentOS release 6.3 (Final) Release: 6.3 Codename: Final

Thank you!

zhssakura commented 6 years ago

Hi loic-couderc, do you have any suggestions to address the problem with SGA? And I ran MATAM with conda commends again after GNU problem solved. There is another problem in attached matam.log file. matam.log

The result files in the directory 'workdir' are listed below: 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.blast 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.fq 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.log 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.sam 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50.asqg 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50.cpts_N1_E1.SGA_by_component.fasta 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50.cpts_N1_E1.component_lca51pct.tab 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50.cpts_N1_E1.components.csv 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50.cpts_N1_E1.metaEdges.csv 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50.cpts_N1_E1.metaNodes.csv 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50.cpts_N1_E1.read_metanode_component.tab 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50.cpts_N1_E1.read_metanode_component_taxo.tab 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50.edges.csv 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.ovgb_i100_o50.nodes.csv 16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.sam components_assembly contigs.NR.fasta contigs.NR.min_500bp.fasta contigs.fasta contigs.sortmerna_vs_complete_SILVA_128_SSURef_NR95_num_align_0.blast contigs.sortmerna_vs_complete_SILVA_128_SSURef_NR95_num_align_0.blast.best_only.selected.sam contigs.sortmerna_vs_complete_SILVA_128_SSURef_NR95_num_align_0.blast.best_only.selected.tab contigs.sortmerna_vs_complete_SILVA_128_SSURef_NR95_num_align_0.blast.best_only.tab contigs.sortmerna_vs_complete_SILVA_128_SSURef_NR95_num_align_0.sam scaffolds.fa

And here are the contigs results. Is the result correct? If it is, can I just run MATAM with errors for my own data? contigs.NR.min_500bp.fasta.zip

Any suggestions would be helpful. Thank you. Shan

loic-couderc commented 6 years ago

Hi @zhssakura,

From the first command line i can see that you was using MATAM from conda package. Then from log file, i can see you use MATAM from the compiled source. Am i right?

If so, an error must have occurred when compiling MATAM with the build.py script. A regular installation have to finish with something like this:

2017-09-18 10:09:07,300 - INFO - -- MATAM building complete --
2017-09-18 10:09:07,300 - DEBUG - Building completed in 377.46 seconds
2017-09-18 10:09:07,300 - DEBUG - MATAM building went well. Program executables can be found in MATAM bin directory: ...

As i have no issue when installing MATAM with conda package on a centos 6.3, i strongly recommend you to use MATAM from conda as in your last comment.

Then when running this command:

/home/z5095298/anaconda3/bin/matam_assembly.py -d /srv/scratch/z5095298/Julie/SPONGES_FOR_TT/13_MATAM/SILVA_128_SSURef_NR95 -i /home/z5095298/anaconda3/examples/16sp_simulated_dataset/16sp.art_HS25_pe_100bp_50x.fq -o /tmp/matam_output --cpu 14 --max_memory 10000 -v --debug --perform_taxonomic_assignment > matam.log 2>&1

the final log in debug mode must look like this: matam.log and terminate as:

matam_assembly.py terminated with no error
2017-11-07 11:42:22,328 - INFO - Run terminated in 3851.6521 seconds wall time

The final directory will contain several files. The reconstructed markers will be found in the final_assembly.fa file (the contigs files are intermediate results).

final_assembly.fa -> /tmp/matam_output/workdir/scaffolds.NR.min_500bp.abd.fa
krona.html -> /tmp/matam_output/workdir/scaffolds.NR.min_500bp.abd.rdp.krona.html
krona.tab -> /tmp/matam_output/workdir/scaffolds.NR.min_500bp.abd.rdp.krona.tab
workdir

Be warned that without the --perform_taxonomic_assignment flag, the final_assembly.fa link will be missing. This an known issue (#51) already fixed but not yet released.

loic-couderc commented 6 years ago

Hi @zhssakura, Is your problem solved?

zhssakura commented 6 years ago

Hi @loic-couderc, It seems that MATAM goes right now with conda commands. However there are still some warnings can not be removed that I can not figure out why. Such as:

`... INFO - === Reads mapping against /srv/scratch/z5095298/Julie/SPONGES_FOR_TT/13_MATAM/ST_matam_assembly_fq/workdir/abundance_ie3vr6fs/idx_prefix ===

WARNING - M01153:79:000000000-AL19U:1:1108:9472:2684 is mapped more than once on the same scaffold ({'38': 2}) but it will contribute to the abundance of this scaffold only as 1 weight where weight=1/uniq_scaffolds_nb=1

WARNING - M01153:80:000000000-AL12U:1:1104:13880:15567 is mapped more than once on the same scaffold ({'6': 2, '5': 2}) but it will contribute to the abundance of this scaffold only as 1 weight where weight=1/uniq_scaffolds_nb=0.5

WARNING - M01153:79:000000000-AL19U:1:1115:9378:7283 is mapped more than once on the same scaffold ({'42': 2}) but it will contribute to the abundance of this scaffold only as 1 weight where weight=1/uniq_scaffolds_nb=1

WARNING - M01153:80:000000000-AL12U:1:1108:22969:6213 is mapped more than once on the same scaffold ({'6': 2, '5': 2}) but it will contribute to the abundance of this scaffold only as 1 weight where weight=1/uniq_scaffolds_nb=0.5 ... `

And for the results in file 'scaffolds.NR.min_500bp.abd.fa', there could be multiple scaffolds belonging to the same 16s sequence with a high identical match for the overlapped parts. Is this a common situation?

Thank you very much for your help.

Shan

loic-couderc commented 6 years ago

Hi @zhssakura,

As in 16s ARNr we do no expect to have repetitions, this warnings are related to a behaviour of SortMeRNA that can map a read on a reference with the same position on rare occasions. That's why this reads will contribute as 1.

It is possible to have different scaffolds overlapping each other with imperfections, then, we assume this sequences to arise from different species.