donia-lab / MetaBGC

A metagenomic strategy for harnessing the chemical repertoire of the human microbiome
GNU General Public License v3.0
32 stars 8 forks source link

Issue with metabgc build failing #13

Closed kellystyles closed 3 years ago

kellystyles commented 4 years ago

Hello, I have been trying to find spHMMs for my protein family of interest using metabgc build, but I haven't been able to successfully complete a metabgc build run. I have prepared synthetic metagenomes as in the supplementary of the metabgc paper (although I only prepared 10 each for high and low synthetic metagenomes for testing).

I have been running the software from a Singularity container with all the prerequisites installed as required in the documentation. The metabgc search function using this container works with the toy data, so I think it is an issue with the build function. Furthermore I have tried running metabgc build from metabgc installed on my university HPC to no avail. Again, the metabgc search function works with the toy data with this locally installed version of metabgc.

I'm not sure if it's just an issue with my inputs or something more, but I would appreciate any help!

A typical run will finish by saying that MetaBGC Build failed in the output. The script I used and the error and output is below:

run script

singularity exec "$metabgc_image" \
metabgc build \
--prot_alignment "$prot_alignment_dir"/poi_align.fasta \
--prot_family_name Poi \
--cohort_name dataset1 \
--nucl_seq_directory "$nucl_seq_dir_ds1" \
--prot_seq_directory "$prot_seq_directory_ds1" \
--seq_fmt fasta \
--pair_fmt interleaved \
--tp_genes_nucl "$prot_alignment_dir"/poi_homologs_CDS_seqs.fasta \
--f1_thresh 0.5 \
--output_directory "$ds1" \

error output

Translate nucleic acid sequences

MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

*** WARNING *** Invalid character '*' in FASTA sequence data, ignored

*** WARNING *** Invalid character '*' in FASTA sequence data, ignored

*** WARNING *** Invalid character '*' in FASTA sequence data, ignored

*** WARNING *** Invalid character '*' in FASTA sequence data, ignored

*** WARNING *** Invalid character '*' in FASTA sequence data, ignored

*** WARNING *** Invalid character '*' in FASTA sequence data, ignored

*** WARNING *** Invalid character '*' in FASTA sequence data, ignored

*** WARNING *** Invalid character '*' in FASTA sequence data, ignored

*** WARNING *** Invalid character '*' in FASTA sequence data, ignored

*** WARNING *** Invalid character '*' in FASTA sequence data, ignored
TP_Homolog 22 seqs, max length 274, avg  length 243
00:00:00     11 MB(1%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00     11 MB(1%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00     16 MB(1%)  Iter   1  100.00%  Align node
00:00:00     16 MB(1%)  Iter   1  100.00%  Root alignment
00:00:00     16 MB(1%)  Iter   2  100.00%  Refine tree
00:00:00     16 MB(1%)  Iter   2  100.00%  Root alignment
00:00:00     16 MB(1%)  Iter   2  100.00%  Root alignment
00:00:00     16 MB(1%)  Iter   3  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter   4  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter   5  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter   6  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter   7  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter   8  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter   9  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter  10  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter  11  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter  12  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter  13  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter  14  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter  15  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter  16  100.00%  Refine biparts
00:00:00     16 MB(1%)  Iter  16  100.00%  Refine biparts
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core `dumped)`
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Error: object 'readID' not found
Run `rlang::last_error()` to see where the error occurred.

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: In addition:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: There were 14 warnings (use warnings() to see them)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:

output

Invoking MetaBGC Build...
Number of domains: 22
Alignment length: 282
Running HMM Build on: /path/to/build/directory/poi__30_10__0_30.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__0_30.hmm /path/to/build/directory/poi__30_10__0_30.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__0_30.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__10_40.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__10_40.hmm /path/to/build/directory/poi__30_10__10_40.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__10_40.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__20_50.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__20_50.hmm /path/to/build/directory/poi__30_10__20_50.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__20_50.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__30_60.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__30_60.hmm /path/to/build/directory/poi__30_10__30_60.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__30_60.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__40_70.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__40_70.hmm /path/to/build/directory/poi__30_10__40_70.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__40_70.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__50_80.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__50_80.hmm /path/to/build/directory/poi__30_10__50_80.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__50_80.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__60_90.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__60_90.hmm /path/to/build/directory/poi__30_10__60_90.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__60_90.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__70_100.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__70_100.hmm /path/to/build/directory/poi__30_10__70_100.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__70_100.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__80_110.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__80_110.hmm /path/to/build/directory/poi__30_10__80_110.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__80_110.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__90_120.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__90_120.hmm /path/to/build/directory/poi__30_10__90_120.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__90_120.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__100_130.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__100_130.hmm /path/to/build/directory/poi__30_10__100_130.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__100_130.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__110_140.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__110_140.hmm /path/to/build/directory/poi__30_10__110_140.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__110_140.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__120_150.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__120_150.hmm /path/to/build/directory/poi__30_10__120_150.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__120_150.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__130_160.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__130_160.hmm /path/to/build/directory/poi__30_10__130_160.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__130_160.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__140_170.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__140_170.hmm /path/to/build/directory/poi__30_10__140_170.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__140_170.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__150_180.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__150_180.hmm /path/to/build/directory/poi__30_10__150_180.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__150_180.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__160_190.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__160_190.hmm /path/to/build/directory/poi__30_10__160_190.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__160_190.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__170_200.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__170_200.hmm /path/to/build/directory/poi__30_10__170_200.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__170_200.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__180_210.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__180_210.hmm /path/to/build/directory/poi__30_10__180_210.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__180_210.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__190_220.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__190_220.hmm /path/to/build/directory/poi__30_10__190_220.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__190_220.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__200_230.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__200_230.hmm /path/to/build/directory/poi__30_10__200_230.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__200_230.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__210_240.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__210_240.hmm /path/to/build/directory/poi__30_10__210_240.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__210_240.fas
Running HMM Build on: /path/to/build/directory/poi__30_10__220_250.fas
hmmbuild -n Poi --amino /path/to/build/directory/poi__30_10__220_250.hmm /path/to/build/directory/poi__30_10__220_250.fas
Done Running HMM Build on: /path/to/build/directory/poi__30_10__220_250.fas

R-script path : /usr/local/lib/python3.6/dist-packages/metabgc/src
Metabgc-build has failed. Please check your inputs and contact support on : https://github.com/donia-lab/MetaBGC
abiswas-odu commented 4 years ago

Can you please check if you have records in ${ds1}/build/CombinedHmmSearch.txt? Also, you can take a look at ${ds1}/build/CombinedBLASTSearch.txt and make sure there are some hits in there.

kellystyles commented 4 years ago

Sorry, I accidentally closed the issue (new to github). Both of those files are present, but are empty with the exception of the CombinedBLASTSearch.txt file which has only the column names.

abiswas-odu commented 4 years ago

Could you please also check you have a bunch of .hmm files in the folder spHMMs? If so, i think the hmmer search is failing for some reason. Which tool did you use to generate the protein sequence read files from the nucleotide sequence files?

kellystyles commented 4 years ago

Yes, there are .fas files with the segmented protein alignment, and some .hmm files, although some of these are empty.

I agree that it must be something with hmmer search. Maybe that's where the "Illegal instruction (core dumped)" errors are coming from.

For your second question, are you referring to the synthetic reads? If so, I have not translated these as the readme says they are computed if not provided.

abiswas-odu commented 4 years ago

I think I understand the immediate problem. Please remove the parameter --prot_seq_directory from your command line and resubmit.

We check if this parameter is pointing to a valid directory. If it is, we assume that it contains the protein files already converted. In this case it finds an empty directory and does not do any HMMER or BLAST searching.

If this parameter is not provided it converts the nt sequences into aa sequences and the files are saved in a folder in the build output directory. It takes longer to do the conversion however. So, for bigger runs it might be better to do the conversion in an array job outside.

I will improve the documentation of the parameter with better verbiage.

kellystyles commented 4 years ago

I tried running metabgc-build without the --prot_seq_directory parameter and it still stopped prematurely without translating the synthetic metagenomes.

I then tried working with a subset of my data (2 synthetic metagenomes only) and translated these synthetic reads myself (using transeq to translate into all 6 frames, and appending 'trans' to the end of the resulting file names) and then running metabgc-build. However I'm not sure if the program recognises the files or not as there is no difference in the output, but the job will run until I stop it (the longest I let it run was 4 days). Is there a naming convention required for these translated synthetic metagenomes?

abiswas-odu commented 4 years ago

Please name the protein sequence files exactly the same as the corresponding nucleotide sequence files. Depending on the size of your input files and the number of spHMMs it will take a while to run.

You should see the hmmer search output files in the ${ds1}/build/hmm_result folder being generated. Each of your samples will be hmmer searched against each spHMM. How many spHMMs do you have? Please count the number of .hmm files in ${ds1}/build/spHMMs folder.

Please provide as many cores as you can using the --cpu option for faster searching.

kellystyles commented 4 years ago

Renaming the protein sequence files to the same name as the nucleotide sequences allowed metabgc-build to continue a bit further. However, now it is failing and returning the error output below:

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_character(),
  X3 = col_character(),
  X4 = col_character(),
  X5 = col_double(),
  X6 = col_character(),
  X7 = col_character()
)

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Parsed with column specification:
cols(
  sseqid = col_character(),
  slen = col_character(),
  sstart = col_character(),
  send = col_character(),
  qseqid = col_character(),
  qlen = col_character(),
  qstart = col_character(),
  qend = col_character(),
  pident = col_character(),
  evalue = col_character(),
  Sample = col_character(),
  sampleType = col_character()
)

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Parsed with column specification:
cols(
  gene_name = col_character(),
  start = col_double(),
  end = col_double(),
  interval = col_character(),
  prot_type = col_character()
)

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Error: Can't subset columns that don't exist.
✖ Column `model_cov` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: In addition:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: There were 12 warnings (use warnings() to see them)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:

It seems to suggest an issue with the EvaluateSpHMMs.R script, with model_cov. I'm not familiar with R though, so it could just be that there's no output written for that column perhaps? How crucial is it to have R v3.6.1? I am only able to install R v3.6.3

My protein alignment is split into 23 .hmm files. All the HMM search results seem to be in 'CombinedHmmSearch.txt'. However, there is still no BLAST output in the 'blastn_result' directory.

abiswas-odu commented 3 years ago

We have released 2.0.0 version the tool. All R dependencies have been removed making build easier to run. A toy build example can be downloaded and tested here: https://github.com/donia-lab/MetaBGC#quick-start