Reproducing LanC spHMMs.

Manoo-hao commented 3 years ago

I'm putting together this issue as a follow up on email correspondence with @frcamacho. I have been trying to produce a LanC model (and others) using metabgc-build and have had variable success with it. I chose LanC to begin with as there is a model available from the 2019 publication, so there is a basis for comparison. I used a singularity container, containing metabgc V2 as of the commit on April 8th, 2021 (https://github.com/donia-lab/MetaBGC/commit/886f55fcfa78da80c0d00d4ba2c8e326549b7d76). (Slurm script attached below)

In this process I came across one major discrepancy right at the start of the build process, i.e. the createsphmms.py script produces a vastly different range of intervals (albeit 43 in both cases), using the same input sequences (see attached spHMMs directory):

TP nucleotide sequences: https://github.com/donia-lab/MetaBGC/blob/master/MetaBGC-V1/MetaBGC-V1_Benchmark_Data/LanC_like/LanC_like-True-Positive-genes.fasta
46 LanC-like protein sequences as from the pfam seed alignment https://github.com/donia-lab/MetaBGC/blob/master/MetaBGC-V1/MetaBGC-V1_Benchmark_Data/LanC_like/LanC_like.fasta

When producing a hmm-consensus sequence (hmmemit -c) from the spHMMs you originally provided (https://github.com/donia-lab/MetaBGC/tree/master/MetaBGC-V1/MetaBGC-V1_Benchmark_Data/LanC_like/spHMMs), I get a continuously overlapping consensus sequence from fragments of 18-30 residues. However, when doing the same with my spHMMs, I end up with multiple gaps and a fragment size range of 7-28 residues.

With the createsphmms script being at the start of the build process, I imagine that this discrepancy also has knock-on effects on the rest of the build process. E.g., at the end of the build process, there are 23 intervals, still ranging from 0-30 to 420-450, in your F1 plot (https://github.com/donia-lab/MetaBGC/blob/master/MetaBGC-V1/MetaBGC-Build_Outputs/LanC_like/plot.eps), whereas I end up with 29 intervals ranging between 0-30 - 860-890, still with major gaps in between segments (see image below + attachment for full HiPer_spHMMs directory). I did notice that the createsphmms script has changed since V2 was released, but the effect is so dramatic that I couldn't help but notice...

lanc_family-F1_Plot

There is of course a bit more randomness involved with the rest of the build process depending on how ART subsamples reads from genomes, and the genomes used as inputs in the first place, e.g. for this example, I have used only 64 instead of 140 synthetic metagenomes (see attached genomes & coverages). However, as far as I understand the build module, the input genomes and their respective coverages don't have an effect on how the alignment and the intervals are produced.

Interestingly, though, no matter with how many metagenomes as inputs I have worked, using an F1-threshold of 0.5 I only ever received one High-Performing interval, which is in my case 730-760. If I produce a consensus sequence from this hmm (hmmemit -c), it significantly overlaps with your interval 350-380 (image below).

Encouragingly, even though I only produced the one HiPer_spHMM, it is consistent with one of the ones you produced. However, the output is so dramatically different, that am wondering if there is more to it than just me overlooking something small. Is there possibly a bigger issue that needs addressing? I would be grateful if you could give me pointers as to how I could reproduce your published spHMMs, and by extension of course also the HiPer_spHMMs at the end of the build process. If you need any further information, please don't hesitate to get in touch.

Attachments:

slurm script for running build: sing_metabgc_build_LanC.txt
spHMMs directory: spHMMs.tar.gz
coverage tables: genome_coverage_tables.tar.gz
HiPer_spHMMs directory: 64mg_sing_HiPer_spHMMs.tar.gz

abiswas-odu commented 3 years ago

I will have to look into this. We may have adjusted the LanC intervals that were produced because the gap counting was not correct or the new code does something slightly different. We tested v2.0 with the 4 cyclase and got the same intervals.

Manoo-hao commented 3 years ago

Thanks for looking into it. I have otherwise only tested V2 with OxyN and also get the same results.

frcamacho commented 3 years ago

@abiswas-odu any updates on this issue?

abiswas-odu commented 3 years ago

I have not got a chance to look at it yet. I will try to get to it next week.

abiswas-odu commented 3 years ago

Thanks for looking into it. I have otherwise only tested V2 with OxyN and also get the same results.

@Manoo-hao Do you mean that you have the sample problem with OxyN as you have with LanC? Or does OxyN work and LanC does not?

Manoo-hao commented 3 years ago

Using the provided tp_genes and alignment files as input:

And running metabgc build from the singularity container as described in the first post.

I get the output attached in the archive below. The archive contains only the first files created by running build, i.e. the spHMMs directory, the translated tp_genes file, alignment, and Gene_Interval_Pos files.

intervals.tar.gz

I have in the meantime deleted the full build output for space reasons, but can re-synthesise metagenomes and re-build from scratch if the complete output would help.

Unrelated to LanC or OxyN, but related to the same issue, I have in the meantime had an instance where, when running build, no spHMMs were generated. Looking at the input sequences, I found a few sequences in the respective protein alignment file (a pfam seed alignment) that were significantly shorter than the majority of the sequences, and as a result, had long stretches of indels at the termini. Removing these sequences and re-aligning solved the problem of no spHMMs being generated.

abiswas-odu commented 3 years ago

The spHMMS for OxyN are the same we generated for the publication. You don't have the high performing spHMMs which we select out of these. They are the in the HiPer_spHMMs folder. But the initial list is the same. For LanC I think we manually adjusted the cutoffs for better selection and gap avoidance. I have to look into the original R scripts @frcamacho developed to perform the LanC model build.

donia-lab / MetaBGC

Reproducing LanC spHMMs. #22