Closed Sidduppal closed 1 year ago
Why do the files need to be decompressed?
Decompression is required to modify the fasta headers and then concatenate the sequences together for the final database.
That's the .sh files /grep that was also modified?
Yes, that's an independent bug fix that I found during some testing. It adds an underscore after the orf ID, preventing any partial matches.
Possible to use zgrep on still gzip'ed files instead?
Or pipe the output from zcat, or use the gzip module in Python
@chasemc
Possible to use zgrep on still gzip'ed files instead?
We are doing file modification on the files to change the FASTA headers. zgrep
would just get the header but modifying it would be hard if it's not unzipped. It was also require needing to use external subprocesses and not internal python modules.
@jason-c-kwan
Or pipe the output from zcat, or use the gzip module in Python
I am currently using the gzip
module in python for file manipulation.
I believe a possible scenario could be there where you use zcat
modify the header and then it, but I don't think how efficient it would be as compared to unzipping the files which takes around 5-10 min.
My confusion and suggestion of zgrep was because I thought the edits were related (in the future try to only fix a single thing in a PR or at least separate out the commits)
My question is then the same as Jason's- is there a reason not to just read the files using the gzip module rather than decompress, write and then read back in
Is the following code's single purpose to read some fasta files, edit the identifier and then concatenate into a single file? https://github.com/KwanLab/Autometa/blob/255066a2cdd9ed9371a2b68a344a269adee56554/autometa/taxonomy/gtdb.py#L57C2-L103
" single purpose" meaning no other code relies on any of the extracted files
Yeah seems like there are too many steps.
Just to comment before I leave for the weekend... If the answer is yes, if possible, probably best to read the desired files (filename match) directly from the tar, edit the header/id while reading and write directly into the concatenated file. Note: I'm not familiar with this section of code function and I don't know the structure of the tar file so this may or may not be a good suggestion
Hit submit before seeing @jason-c-kwan responded
Hi,
I'd like to use the gtdb database, but I'm not able to build it. Any news when this will be fixed?
Cheers Bastian
Looks like the tests are failing due to a recent issue with hdbscan and cython (https://github.com/scikit-learn-contrib/hdbscan/issues/600)
Rolling back cython as suggested by some comments in
Looks like the tests are failing due to a recent issue with hdbscan and cython (scikit-learn-contrib/hdbscan#600)
didn't work. .
:bug: Fix bug with running GTDB taxonomic workflow 🐛 Fix bug with setting up gtdb database. issue328