Kalan-Lab / lsaBGC-Pan

lsaBGC - suite for pan-BGC-omics analysis
BSD 3-Clause "New" or "Revised" License
18 stars 2 forks source link

Queries #5

Closed MHassanSaeed closed 2 months ago

MHassanSaeed commented 2 months ago

Greetings, Thanks for the amazing tool.

I conda install bioconda::lsabgc to install the package. Later I database using setup_annotation_dbs.py and run the test commands. Upon which it says "unrecognized arguments: -rsh" which can be in run_test.sh as lsaBGC-Pan -g input_genomes/ -o lsaBGC-Pan_Results/ -c 4 -nb -rsh.

Before running the test and db setup, I ran it on my data, and it ran well till the break; after setting the parameters and installing databases, I ran the next commands.

  1. I hope this way of installation won't affect the results.

  2. All results were produced except for lsaBGC-Reconcile Plots; no plot or script are found besides two colouring PDFs.

  3. If genomes are provided as raw input data, does it only produce Gecco results? Is it possible to have antiSMASH results using lsaBGC-Pan?

  4. Is the Jaccard Index cutoff of 0 and an MCL inflation of 5 fine here? (Image attached) image

raufs commented 2 months ago

Thank you for using lsaBGC-Pan and the report!

1. I hope this way of installation won't affect the results.

It sounds like your installation is properly setup with v1.0.6. Sorry, I am currently undergoing updates to the next version - v1.0.7 - and this -rsh argument is being introduced in it. To run the tests with version v1.0.6 - you can just remove the -rsh flag in the bash script and it should work. Thank you for mentioning this, I might revert back to not having this option in the test command to make sure it works regardless of the version of lsaBGC-Pan.

2. All results were produced except for lsaBGC-Reconcile Plots; no plot or script are found besides two colouring PDFs.

Great, and thank you a ton for the report here, indeed I broke something in v1.0.6 updates with regards to lsaBGC-Reconcile from v1.0.5 so this should also be patched in v1.0.7 which you can expect also available via bioconda later today.

3. If genomes are provided as raw input data, does it only produce Gecco results? Is it possible to have antiSMASH results using lsaBGC-Pan?

Yes, you can run a joint analysis with antiSMASH and GECCO but you need to provide a directory with antiSMASH results precomputed for each sample. Then you would just use the -rg flag to specify you also want GECCO BGC predictions. Unfortunately, because both lsaBGC-Pan and antiSMASH have many dependencies it becomes difficult to put them in the same conda package and just have it automatically run with raw genomes.

4. Is the Jaccard Index cutoff of 0 and an MCL inflation of 5 fine here? (Image attached)

Sure, I think it might make sense. It is generally more informative to look at the specific parameter combination plots. Generally, having 0 Jaccard index means that there is no strict requirement that BGC pairs prior to MCL clustering to form GCFs must have some overlap, so it might be good to increase it to 20. But if you look at the JIC 0, MCL 5 combination plot that looks something like this:

image

And you see that the upper left barplot only has the categories of 'core exists' or 'not relevant' - this indicates that for non-singleton GCFs (GCFs with two or more BGC instances) a core set of orthogroups is being found. Then the two larger barplots below will also be informative and will let you know if you are getting potentially paralogous instances in a GCF (the grey/black plot - note, if you are working with draft quality assemblies - "paralogous" instances could also just be the same BGC split because of an imperfect assembly - will note this better in the documentation) and whether the annotation types of BGCs (e.g. polyketide, nrp) are the same kind within GCFs.

Thanks again and will comment later today once v1.0.7 is up on bioconda - just performing some final minor adjustments and tests!

Kind regards, Rauf

raufs commented 2 months ago

Hi, to follow-up, new releases of lsaBGC on bioconda are now available - current latest is v1.0.9 - which introduce visuals for lsaBGC-Sociate as well as patch the issue with lsaBGC-Reconcile.

There might be one additional new release today or sometime this week as I am trying to adjust threading related matters that are more of an issue if you are running larger analyses on servers, but I think this should be the last release in a while.

raufs commented 2 months ago

Thank you again for the feedback and great questions! Closing the issue, but please just re-oopen or open a new one if you have additional questions or not everything was properly addressed!

MHassanSaeed commented 2 months ago

Thank you very much for your detailed response. I suggest adding additional information about Jaccard distance and MCL inflation so users can get an idea of which parameters should be best to use (for instance, I also don't know much about it. That's why I asked if use the correct parameters).

raufs commented 2 months ago

Hi, thank you again for the question and using the software. I thought I linked the documentation for lsaBGC-Cluster to what was on the original lsaBGC wiki page: https://github.com/Kalan-Lab/lsaBGC/wiki/05.-Clustering-BGCs-into-GCFs - but it doesn't look like I ever did, so thank you for bringing this up! The core of lsaBGC-Cluster works mostly the same as before and the report is similar too but a couple types of plots have been removed or replaced. The one major difference in the "new" lsaBGC-Cluster is that it is smarter about handling BGC instances which are nearby scaffold/contig edges. It now automatically considers a pair of BGC instances as potentially being in the same GCF prior to MCL clustering if for such a BGC, nearby a scaffold/contig edge, 70% of the ortholog groups are found in the other BGC instance. This parameter can be controlled via the -cc option.

I am planning on updating the documentation later this week. The updates will mainly address how to interpret lsaBGC-Sociate results where a lot of the recent updates have been focused on. But I think copying over and adapting the Wiki for lsaBGC-Cluster also makes sense. I might also release v1.1.2 later this week - there should be no functional difference with v1.1.1 - just improving code documentation on the back-end.

raufs commented 2 months ago

Documentation now updated - with additional info on lsaBGC-Cluster on page 7 of the wiki. Thank you a ton for bringing this up - I have a more concise explanation of the algorithm there and made other updates to the original documentation.

Closing this again, but we really appreciate the feedback! Please re-open or open a new ticket if something else comes up!

MHassanSaeed commented 1 month ago

Thanks for the update. I was wondering have you ever tried extracting RiPP and NRPs experimentally in the lab? Any insights would be helpful.

raufs commented 1 month ago

I personally do not have much insight to offer there beyond the simple experiments we performed for the original lsaBGC study. I focus on bioinformatics, but am in a lab where others investigate such secondary metabolites. They have had good experiences using tools from the Dorrestein and van der Hooft labs for connecting genomics and metabolomics. I think their labs also participate/organize many workshops that might be of interest to you.

MHassanSaeed commented 1 month ago

Okay, it make sense, thanks for responding.

I believe, this tool is useful for both metagenomes and whole genomes?

Secondly, does output folder should be named as "lsaBGC-Pan_Results"? Otherwise, it throws an error No such file or directory: '/usr/F_lsaBGC-Pan_Results/Gene_Calling/genome_index.1.gbk``

raufs commented 1 month ago

No problem! The original lsaBGC had some metagenomic functionalities, but those have been removed from lsaBGC-Pan. You can use lsaBGC-Pan with metagenome-assembled genomes (MAGs) however, and it does have improved consideration for whether BGC might be fragmented in the final reports compared to the original lsaBGC.

I believe there might be another problem with the run. Can you share additional details like the command you ran or the particular input genome file related to the error?

MHassanSaeed commented 1 month ago

I am curious about genomes (I am analyzing whole genome sequences, not metagenome-assembled genomes (MAGs)). I guess lsaBGC-Pan is fine running for those genomes?

lsaBGC-Pan -g fna-faecalis/ -o lsaBGC-Pan_Results -c 28 works fine

lsaBGC-Pan -g fna-faecalis/ -o faecalis_lsaBGC-Pan_Results -c 28 produces error

Step 1: Beginning by assessing input files

Traceback (most recent call last): File "/usr/conda/envs/lsabgc/bin/lsaBGC-Pan", line 1094, in lsaBGC() File "/usr/conda/envs/lsabgc/bin/lsaBGC-Pan", line 687, in lsaBGC with open(genome_gbk) as ogg: FileNotFoundError: [Errno 2] No such file or directory: '/usr/lsaBGC-Pan_Results/Gene_Calling/identities.gbk'

raufs commented 1 month ago

Hi, I think the issue might be related to new files that have appeared in your input genomes directory between the two runs. Can you run ls -lhta on the fna-faecalis/ folder?

The output directory should not need to be named something specific.

MHassanSaeed commented 1 month ago

Yes, the genomes directory does have text and CSV files and other fasta files.

Could please confirm about using this tool for whole genomes?

raufs commented 1 month ago

Great, so you would just need to move those non genome files somewhere else.

Yes, the program is intended for isolate genomes, the higher quality the better! I was just saying that it should also work with MAGs or less/draft quality genome assemblies too. This is because it tracks which BGCs are nearby scaffold edges and might thus be fragmented and the program has some handling of such cases.