:bug: Fix GTDB database setup

KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes

https://autometa.readthedocs.io

Other

40 stars 15 forks source link

:bug: Fix GTDB database setup #329

Closed Sidduppal closed 1 year ago

Sidduppal commented 1 year ago

:bug: Fix bug with running GTDB taxonomic workflow 🐛 Fix bug with setting up gtdb database. issue328

chasemc commented 1 year ago

Why do the files need to be decompressed?

Sidduppal commented 1 year ago

Decompression is required to modify the fasta headers and then concatenate the sequences together for the final database.

chasemc commented 1 year ago

That's the .sh files /grep that was also modified?

Sidduppal commented 1 year ago

Yes, that's an independent bug fix that I found during some testing. It adds an underscore after the orf ID, preventing any partial matches.

chasemc commented 1 year ago

Possible to use zgrep on still gzip'ed files instead?

jason-c-kwan commented 1 year ago

Or pipe the output from zcat, or use the gzip module in Python

Sidduppal commented 1 year ago

@chasemc

Possible to use zgrep on still gzip'ed files instead?

We are doing file modification on the files to change the FASTA headers. zgrep would just get the header but modifying it would be hard if it's not unzipped. It was also require needing to use external subprocesses and not internal python modules.

@jason-c-kwan

Or pipe the output from zcat, or use the gzip module in Python

I am currently using the gzip module in python for file manipulation. I believe a possible scenario could be there where you use zcat modify the header and then it, but I don't think how efficient it would be as compared to unzipping the files which takes around 5-10 min.

chasemc commented 1 year ago

My confusion and suggestion of zgrep was because I thought the edits were related (in the future try to only fix a single thing in a PR or at least separate out the commits)

My question is then the same as Jason's- is there a reason not to just read the files using the gzip module rather than decompress, write and then read back in

Is the following code's single purpose to read some fasta files, edit the identifier and then concatenate into a single file? https://github.com/KwanLab/Autometa/blob/255066a2cdd9ed9371a2b68a344a269adee56554/autometa/taxonomy/gtdb.py#L57C2-L103

" single purpose" meaning no other code relies on any of the extracted files

jason-c-kwan commented 1 year ago

Yeah seems like there are too many steps.

Get protein accession from filepath (can be done on gz)
Open combined gz file for writing with gzip module
Open each component file with gzip, write line to output gzip, change header line as appropriate
Close output file. Not above, all files can remain gzipped, but you are effectively copying all the input faa files into a combined file. Is this necessary?

chasemc commented 1 year ago

Just to comment before I leave for the weekend... If the answer is yes, if possible, probably best to read the desired files (filename match) directly from the tar, edit the header/id while reading and write directly into the concatenated file. Note: I'm not familiar with this section of code function and I don't know the structure of the tar file so this may or may not be a good suggestion

chasemc commented 1 year ago

Hit submit before seeing @jason-c-kwan responded

bheimbu commented 1 year ago

Hi,

I'd like to use the gtdb database, but I'm not able to build it. Any news when this will be fixed?

Cheers Bastian

evanroyrees commented 1 year ago

Looks like the tests are failing due to a recent issue with hdbscan and cython (https://github.com/scikit-learn-contrib/hdbscan/issues/600)

chasemc commented 1 year ago

Rolling back cython as suggested by some comments in

Looks like the tests are failing due to a recent issue with hdbscan and cython (scikit-learn-contrib/hdbscan#600)

didn't work. .