R script that subsets tables to only contain representative genomes

erikrikarddaniel / pf-gtdb-analyses

Analysis tools for Pfitmap/RNRdb/GTDB

MIT License

0 stars 1 forks source link

R script that subsets tables to only contain representative genomes #1

Closed erikrikarddaniel closed 4 years ago

erikrikarddaniel commented 4 years ago

Add an R script (i.e. not an Rmd) to scripts that reads all the feather files and the gtdb_metadata.tsv and outputs new feather files that only contain data referring to the species representative genomes.

GhadaNOUAIRIA commented 4 years ago

the list of representatives is in a file called sp_clusters. tsv (not gtdb_metadata.tsv), my script reads representatives directly from sp_clusters.tsv and does not use gtdb_metadata.tsv

erikrikarddaniel commented 4 years ago

OK. Then we need to add download of this file in the data/Makefile.

GhadaNOUAIRIA commented 4 years ago

We need to download it and publish to results also. Maybe we can add this file to the GetMetadata process in main.nf? Or should it have its own process?

erikrikarddaniel commented 4 years ago

Since this is not part of the workflow it only needs to be downloaded by the build process in this repo. (Another question is whether it would be better as part of the workflow, but lets leave it as it is now.)

GhadaNOUAIRIA commented 4 years ago

added to Makefile