AnantharamanLab / METABOLIC

A scalable high-throughput metabolic and biogeochemical functional trait profiler
173 stars 42 forks source link

METABOLIC-C for Nanopore reads (Unpaired) #55

Closed roytangent closed 2 years ago

roytangent commented 2 years ago

Hi! For context, I have 15 sets of marine samples, each sequenced and basecalled using ONT minION. I intend to use METABOLIC-C.pl since I have community data. However, it seems that I have to use .fasta files and list the paired reads for the script to work. Problem is, the output from the MINION is in .fastq and I don't think they are paired read. Any idea how I can run the script to analyse my dataset?

ChaoLab commented 2 years ago

Hi Roy, I can implement a flag for dealing with long reads for METABOLIC-C. I will keep you posted.

ChaoLab commented 2 years ago

Hi @roytangent, I have updated the METABOLIC-C script for intaking long reads (https://github.com/AnantharamanLab/METABOLIC/wiki/METABOLIC-Usage#all-required-and-optional-flags). You will need add a flag of "-st" (or "-sequencing-type") to indicate your sequencing type (for your case, it will be "-st nanopore"). Please also notice the input format for long reads as indicated in the 3rd point (at the bottome of https://github.com/AnantharamanLab/METABOLIC/wiki/METABOLIC-Usage#all-required-and-optional-flags). Should you have any more issues, be glad to hear from you.

roytangent commented 2 years ago

That's awesome, will try out the new feature. Thank you @ChaoLab !

roytangent commented 2 years ago

Hi @ChaoLab ,

Trying out the flag using the commands below: perl METABOLIC_run/METABOLIC/METABOLIC-C.pl -in-gn 01_individual_reads -r sample01.txt -st nanopore -tax class -t 20 -o sample01 (sample01.txt includes the file paths to my reads for my first sample.) It seems that the script tries to look for .faa file however and the output is erroneous. Am I doing something wrong?

ChaoLab commented 2 years ago

Can you paste the log file? In addition, minimap2 should be installed as a dependency. The flag "-in-gn" asks for the folder containing genome files (genomic DNA) in fasta format

roytangent commented 2 years ago

My log looks like this: METABOLIC-C.pl v4.0 Run Start: 2021-11-27 11:07:25 Run End: 2021-11-27 11:07:36 Total running time: 00:00:11 (hh:mm:ss) Input Reads: sample01.txt Reads type: metaG Sequencing type: nanopore Input Genome directory (nucleotides): 01_individual_reads/ Number of Threads: 20 Prodigal Method: meta KOfam DB: full Module Cutoff Value: 0.75 Taxonomic level to calculate MW-score table: class Output directory: sample01

I see! the output after basecalling from nanopore sequencing is only in .fastq. Do you think I should convert it into a .fasta format instead? Additionally, I have concatenated all my reads into a single .fastq file in my read directory for each sample so it looks something like this in my "01_individual_reads": sample01.fastq , sample02.fastq, sample03.fastq If I want to do the analysis on each individual samples, do I still have to provide a text file under the -r flag.

cheers!

ChaoLab commented 2 years ago

Hi @roytangent I think you get it wrong for the input genomes required by the flag "-in-gn" if I understand your words properly. The input genomes should be metagenome-assembled genomes (MAGs) that are generated by some binning tools (metawrap, metabat, maxbin, and so on). We will take genome files (genomic DNA files, in fasta format) by the flag "-in-gn".

roytangent commented 2 years ago

I see, thanks for the clarification. I have entered the required genome from my previous binning run and has managed to get ths script running. I am able to produce the nutrient cycling diagram for each bin. However, the Sankey diagrams and the community plots are returning a blank. On checking the log here is what I got:

[2021-11-28 16:14:52] INFO: GTDB-Tk v1.6.0 . [2021-11-28 16:15:27] INFO: Completed 40 genomes in 0.46 seconds (87.11 genomes/second). [2021-11-28 16:15:27] INFO: Done. [2021-11-28 16:15:27] INFO: Aligning markers in 40 genomes with 80 CPUs. [2021-11-28 16:15:27] INFO: Processing 37 genomes identified as bacterial. [2021-11-28 16:15:32] INFO: Read concatenated alignment for 45,555 GTDB genomes. [2021-11-28 16:15:32] TASK: Generating concatenated alignment for each marker. [2021-11-28 16:15:38] ERROR: Uncontrolled exit resulting from an unexpected error.

EXCEPTION: BlockingIOError MESSAGE: [Errno 11] Resource temporarily unavailable

Traceback (most recent call last): File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/main.py", line 95, in main gt_parser.parse_options(args) File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/main.py", line 735, in parse_options self.align(options) File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/main.py", line 291, in align markers.align(options.identify_dir, File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/markers.py", line 516, in align user_msa = align.align_marker_set(cur_genome_files, marker_info_file, copy_number_f, self.cpus) File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/pipeline/align.py", line 219, in align_marker_set single_copy_hits = get_single_copy_hits(gid_dict, copy_number_file, cpus) File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/pipeline/align.py", line 78, in get_single_copy_hits with mp.get_context('spawn').Pool(processes=cpus) as pool: File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/multiprocessing/context.py", line 119, in Pool return Pool(processes, initializer, initargs, maxtasksperchild, File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/multiprocessing/pool.py", line 212, in init self._repopulate_pool() File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool return self._repopulate_pool_static(self._ctx, self.Process, File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static w.start() File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 58, in _launch self.pid = util.spawnv_passfds(spawn.get_executable(), File "/home/groy2/miniconda3/envs/METABOLIC_v4.0/lib/python3.8/multiprocessing/util.py", line 452, in spawnv_passfds return _posixsubprocess.fork_exec( BlockingIOError: [Errno 11] Resource temporarily unavailable

I am using release 202 for gtdbk. Any way to resolve the error? Thank you so much

ChaoLab commented 2 years ago

@roytangent It seems that something is wrong with your GTDB-Tk. You can try to solve this problem first by running GTDB-Tk alone for a test.

roytangent commented 2 years ago

GTDB-Tk was ok after running the install-test and tests on their provided data. I tried running the entire script again but was faced with the same error. I'm not sure but could it have something to do with /miniconda3/envs/METABOLIC_v4.0/lib/python3.8/multiprocessing/ ?

ChaoLab commented 2 years ago

@roytangent It might have possibilities. While, I think in most cases, there will be no problems with python, if you installed them correctly when setting up the conda env. Did you use GTDB-Tk to run your 40 genomes? Did you check the format of each fasta file (the headline of each sequence, unix format instead of dos format, etc)? Many problems are rooted from small format issues as I have experienced the most.

roytangent commented 2 years ago

@ChaoLab it seems that the problem only arises above a certain amount of threads used using the flag -t. Used the default and the script ran perfectly. Thank You!