DavidBSauer / OGT_prediction

Scripts for calculating features and regression of prokaryote OGT
GNU General Public License v3.0
23 stars 4 forks source link

no OGT predicted for part of the data #2

Closed dspeth closed 4 years ago

dspeth commented 4 years ago

Hi David,

I've managed to run OGT prediction on my own metagenome assembled genomes (MAGs), using the models present in the data/prediction_demo/regression_models/ directory. The results seem to make sense given my prior knowledge of the organisms in the sample. This is awesome, but I'm running into two issues:

1) Two of my genomes go missing early on in the analysis I've checked the genomes_retrieved.txt & species_taxonomic.txt files, and the genomes/ directory and all contain 331 genomes but the predictor only picks up 329. Any idea what's going on there?

2) Prediction is dependent on the presence of 16S rRNA genes, which are often absent from MAGs. In my case, only 112 out of 329 recognized genomes contain a 16S rRNA gene. I imagine genomes from uncultivated species are a major target for OGT prediction, so It would be great if there was a way to make presence of 16 rRNA optional, or possibly remove it from the prediction model. I'll have a look at the scripts to see if I can find an easy way to do so, but advice would be appreciated.

Thanks! Daan

DavidBSauer commented 4 years ago

Hello Daan, Thank you for the feedback. It is good to hear my tool is proving useful.

Regarding your questions:

  1. Have you checked the prediction.log file, does it say anything about the genomes in question? If a genome cannot be predicted for some reason, the reason should be logged.

  2. This one is easy. I have am recalculating models without features from tRNA, rRNA, or both. As these are for MAGs, I did this for both the models with and without genome size. I'll have those updated shortly. You'll have to choose which model you think is best for your data, but you will be able to predict OGT for those genomes without identified RNAs. Though, there may be some loss in accuracy due to fewer features being used.

David

dspeth commented 4 years ago

ad 2. Awesome! It looks like the lack of prediction in almost all MAGs is solely due to the lack of rRNA genes in the MAGs. Good point about considering the genome size, i had not thought of that complication yet. Thanks

ad 1. here's the head of the prediction.log file. After checking for dependencies and loading the models, it indicates 329 species have been detected. I'm sure there's some error in one of the files I feed the script, but haven't been able to find it. Is the genomes_retrieved.txt file the source of the numbers at that point in the log file?

tRNAscan-SE version info: tRNAscan-SE 2.0.5 (October 2019) bedtools version info: bedtools v2.29.2 barrnap version info: barrnap 0.9 prodigal version info: Prodigal V2.6.3: February, 2016 Numpy version: 1.18.1 Biopython version: 1.76 Python version: 3.8.1 | packaged by conda-forge | (default, Jan 29 2020, 14:55:04) [GCC 7.3.0] Platform: Linux-2.6.32-754.10.1.el6.x86_64-x86_64-with-glibc2.10 the directory of OGT regression models: ../data/prediction_demo/regression_models/ the file of genomic sequences to predict: genomes_retrieved.txt file of taxonomy for each genome: species_taxonomic.txt only using regression models with an R2 >= 0.5 reading in the class-Actinobacteria linear regression model file: ../data/prediction_demo/regression_models/class-Actinobacteria-all_features.txt reading in the class-Alphaproteobacteria linear regression model file: ../data/prediction_demo/regression_models/class-Alphaproteobacteria-all_features.txt reading in the class-Bacilli linear regression model file: ../data/prediction_demo/regression_models/class-Bacilli-all_features.txt reading in the class-Betaproteobacteria linear regression model file: ../data/prediction_demo/regression_models/class-Betaproteobacteria-all_features.txt reading in the class-Clostridia linear regression model file: ../data/prediction_demo/regression_models/class-Clostridia-all_features.txt reading in the class-Gammaproteobacteria linear regression model file: ../data/prediction_demo/regression_models/class-Gammaproteobacteria-all_features.txt reading in the family-Bacillaceae linear regression model file: ../data/prediction_demo/regression_models/family-Bacillaceae-all_features.txt reading in the family-Lactobacillaceae linear regression model file: ../data/prediction_demo/regression_models/family-Lactobacillaceae-all_features.txt reading in the order-Bacillales linear regression model file: ../data/prediction_demo/regression_models/order-Bacillales-all_features.txt reading in the order-Burkholderiales linear regression model file: ../data/prediction_demo/regression_models/order-Burkholderiales-all_features.txt reading in the order-Clostridiales linear regression model file: ../data/prediction_demo/regression_models/order-Clostridiales-all_features.txt reading in the order-Corynebacteriales linear regression model file: ../data/prediction_demo/regression_models/order-Corynebacteriales-all_features.txt reading in the order-Enterobacterales linear regression model file: ../data/prediction_demo/regression_models/order-Enterobacterales-all_features.txt reading in the order-Lactobacillales linear regression model file: ../data/prediction_demo/regression_models/order-Lactobacillales-all_features.txt reading in the order-Pseudomonadales linear regression model file: ../data/prediction_demo/regression_models/order-Pseudomonadales-all_features.txt reading in the phylum-Actinobacteria linear regression model file: ../data/prediction_demo/regression_models/phylum-Actinobacteria-all_features.txt reading in the phylum-Bacteroidetes linear regression model file: ../data/prediction_demo/regression_models/phylum-Bacteroidetes-all_features.txt reading in the phylum-Euryarchaeota linear regression model file: ../data/prediction_demo/regression_models/phylum-Euryarchaeota-all_features.txt reading in the phylum-Firmicutes linear regression model file: ../data/prediction_demo/regression_models/phylum-Firmicutes-all_features.txt reading in the phylum-Proteobacteria linear regression model file: ../data/prediction_demo/regression_models/phylum-Proteobacteria-all_features.txt reading in the superkingdom-Archaea linear regression model file: ../data/prediction_demo/regression_models/superkingdom-Archaea-all_features.txt reading in the superkingdom-Bacteria linear regression model file: ../data/prediction_demo/regression_models/superkingdom-Bacteria-all_features.txt initial number of species to be predicted: 329 initial number of genomes to be analyzed: 329 number of species to be predicted: 329 number of genomes to be analyzed: 329

DavidBSauer commented 4 years ago

Based on the log file, two of the genomes are either absent or duplicates (same genome name) in the input file. This could arise from 1) duplicate entries in the input file, or 2) if two genomes from two different species have the same file name.

I will tweak the code to change how the input file is parsed (#2) and logged (#1).

DavidBSauer commented 4 years ago

New regression models are up. I am going to close this issue and create a new one for the genome name parsing/logging to keep the two issues separate.

MdUmar-tech commented 3 years ago

hello, sir, I am having novel genus, I am not able to analyses it problem is downloading data so without first two command like genome download and classify command how will I proceed for OGT calculation. If I download genome manually than how I can proceed Thanks