Simon A. Babayan, Richard J. Orton and Daniel G. Streicker
A series of scripts and datasets described in Babayan et al. (2018) Science doi: 10.1126/science.aap9072 which predict the reservoir hosts, existence of arthropod vectors and identity of arthropod vectors using gradient boosting machines.
_BabayanEtAlsequences.fasta contains coding sequences for all viruses used in the analyses
EbolaTimeSeriesData.csv contains epidemiological data and genomic features for Zaire ebolaviruses sampled during the 2014-2016 West African outbreak
_BabayanEtAlVirusData.csv contains reservoir host, arthropod-borne transmission status and vector taxa for all ssRNA viruses analyzed and features extracted from the genome of each virus
_arthropodBornefeatureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting arthropod-borne transmission across different training sets
_arthropodBornePN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using phylogenetic neighborhoods and genomic features selected by _arthropodBornefeatureSelection.R
_arthropodBornePN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using phylogenetic neighborhoods
_arthropodBorneselGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using genomic features selected by _arthropodBornefeatureSelection.R
_reservoirfeatureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting reservoir hosts across different training sets
_reservoirPredictPN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using phylogenetic neighborhoods and genomic features selected by _reservoirfeatureSelection.R
_reservoirPredictPN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using phylogenetic neighborhoods
_reservoirPredictselGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using genomic features selected by _reservoirfeatureSelection.R
_vectorPredictfeatureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting reservoir hosts across different training sets
_vectorPredictPN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using phylogenetic neighborhoods and genomic features selected by _vectorPredictfeatureSelection.R
_vectorPredictPN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using phylogenetic neighborhoods
_vectorPredictselGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using genomic features selected by _vectorPredictfeatureSelection.R
_algocomparison.py Compares the predictive power of a variety of competing machine learning algorithms to predict reservoir hosts, arthropod-borne transmission and vector taxa from all possible genomic features