Open jolespin opened 2 years ago
Just following up on this. Any thoughts? If you don't have the bandwidth to implement please let me know and I'll try to find an alternative. Thanks.
The biggest limiting factor is load_tax_info
is loaded in eukcc.py
and treehandler.py
so it doesn't have access to the initial arguments.
Right now, load_tax_info
when using NCBITaxa essentially hardcodes the $HOME directory.
Another alternative to ete3 is just using taxonkit lineage
.
I've tried a few different manual edits for this but other than hardcoding my actual path, it's unusable.
Sorry I did not get back to you, I am currently quite busy as well but should be getting to this now.
Adding the ETE database is a very good idea in my opinion and on my todo list.
I will have a look at your suggestions and might come up with a solution to this issue.
Updating metaEuk, I am not sure about as the training was done on metaeuk 4 and updating it might lead to wrong results.
Providing custom protein files, you mean instead of providing the genome? That is already supported. Just pass the option -AA
.
Sorry again for the delay.
Awesome, thanks for getting back to me. I took a couple of cracks at it (hence the fork) but I wasn't successful without hardcoding it. I hope some of the notes above help out if you decide to implement this.
Didn't see the -AA option but that seems like exactly what I was looking for with that note!
More than willing to test out some code for you if you decide to implement.
So, I have looked into the ETE3 database inclusion.
If you want to test it, download this new database http://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.2.tar.gz
and install the version on the dev branch including the commit 36d0257
Thank you for the motivation. Let me know if this works.
Should the usage of the provided ete database be optional or mandatory? I am leaning towards mandatory as then I do not need to add more code, but am open for comments.
If it's already in eukcc2_db_ver_1.2.tar.gz
, I would use that as a default if it's easy to make the program easier to run but being able to specify optionally would make it more flexible for people to use the NCBI taxdb that their dataset is based off.
I've been taking a deep dive into eukaryotic metagenomics and EukCC seems like a great tool to have in the repertoire. From the documentation, it seems that EukCC version 2 is still in development so I thought I could make a log of suggested changes or edits:
Then the following code could be edited: From
base.py
from
__main__.py
:EDIT: The commands above won't work because I falsely assumed
EUKCC2_DB
was an environment variable that was propagated through the different modules but that's not the case. I was able to modify the following:but was unable to figure out how to get the db path from
treehandler.py
(this function:tax_LCA
)Example error:
Is there a way to access the database path throughout all the scripts?
Apologies for bombarding your GitHub today. Finally got to a point in my pipeline that I've been working on for over a year and the EukCC bit is a critical stage.