EBI-COMMUNITY / ebi-parasite

GNU General Public License v3.0
3 stars 2 forks source link

Update assembly.py to run spade as the default assembly software #2

Open nimapak opened 6 years ago

nimapak commented 6 years ago

Please update assembly.py to run spade as the default assembly software, similar to quality control software the user shall be able to specify the assembly software. The default is spades (http://bioinf.spbau.ru/spades) and the only other software shall be Velvet. The input data shall be either one fastq file or two fastq files.

The script need to get a prefix to add to all the file names, lets say if the user wish to have ERR1234 as the prefix all the files that is created by the script shall start by ERR1234_

Please also use the name of the assembly program in all the file names, so the naming structure for above examples would be: ERR1234velvet Or ERR1234spades

Please add this prefix option to your quality control software, including the name of the quality control software.

xinliu005 commented 6 years ago

After running Spades on single fastq file and paired fastq files, the following temp dirs and output files were got:

temporary dirs: corrected, K21, K33, K55, K77, tmp, misc

output files: input_dataset.yaml, params.txt, dataset.info, before_rr.fasta contigs.fasta scaffolds.fasta scaffolds.paths assembly_graph_with_scaffolds.gfa assembly_graph.fastg contigs.paths warnings.log spades.log

They can be found in /nfs/production/seqdb/embl/developer/xin/new_eclipse_dir/working_dir/parasite_genome_analysis/assembly/single_fastq /nfs/production/seqdb/embl/developer/xin/new_eclipse_dir/working_dir/parasite_genome_analysis/assembly/paired_fastq

Need I invest each output file in detail and then decide which need to be kept and add prefix?

xinliu005 commented 6 years ago

scaffolds.fasta – resulting scaffolds (recommended for use as resulting sequences) contigs.fasta – resulting contigs assembly_graph.fastg – assembly graph contigs.paths – contigs paths in the assembly graph scaffolds.paths – scaffolds paths in the assembly graph before_rr.fasta – contigs before repeat resolution

corrected/ – files from read error correction configs/ – configuration files for read error correction corrected.yaml – internal configuration file Output files with corrected reads

params.txt – information about SPAdes parameters in this run spades.log – SPAdes log dataset.info – internal configuration file input_dataset.yaml – input fastq file full path, read orientation, type (single or paired) assembly_graph_with_scaffolds.gfa – Assembly graph in GFA format warnings.log – all warnings

Suggestion: all files can be kept, but only scaffolds.fasta will be add prefix and used by the next step.

xinliu005 commented 6 years ago

assembly.py: 1) arguments a) assembly_software: only spades or velvet permitted, otherwise print error + exit; if not provided, used spades as default. b) prefix: add prefix+assembly_software+run_type(single or paired) to the file name, and the file structure will be like ERR1234_spadessingle* 2) command: single: spades -s fq1 --careful --cov-cutoff auto -o output_dir paired: spades -1 fq1 -2 fg2 --careful --cov-cutoff auto -o output_dir 3) output: All output files from spades were written in working_dir, and then scaffolds.fasta and spades.log will be copied to out_dir with new names

quality_control.py: 1) arguments a) prefix: add prefix+qc_software+run_type(single or paired) to the file name, and the file structure will be like ERR1234_trim_galoresingle*

utilities.py: Two new methods added: class fileutils(object): def add_file_prefix(self,source_fpath,prefix):
def copy_file_add_prefix(self,source_fpath,outdir,prefix):

assembly.py, quality_control.py, and utilities.py were tested and will be committed in.

xinliu005 commented 6 years ago

assembly.py, quality_control.py, and utilities.py were successfully committed.