AlexanderKroll / ESP

MIT License
62 stars 22 forks source link

Missing Data #15

Closed EasternCaveMan closed 6 months ago

EasternCaveMan commented 8 months ago

Dear Authors, would you please provide the missing files: /ESP/data/enzyme_data/Uniprot_df_with_ESM1b.pkl /ESP/data/splits/df_train.pkl /ESP/data/splits/df_test.pkl Best, Vahid

EasternCaveMan commented 7 months ago

Dear Authors,

I hope this message finds you well. I would greatly appreciate it if you could provide the missing files. I need them to proceed with my master's thesis, which is closely related to your work.

Thank you so much.

Vahid Atabaigielmi

AlexanderKroll commented 7 months ago

Dear Vahid,

the files are not included in the repo because they should be easily reproducible from the uploaded files. For example, "df_train.pkl" and "df_test.pkl" are just slight modifications of "df_UID_MID_train_phylo.pkl", "df_UID_MID_train_exp.pkl" and "df_UID_MID_test_exp_phylo.pkl", all of which are uploaded (see https://github.com/AlexanderKroll/ESP/blob/main/notebooks_and_code/1_0%20-%20Creating%20enzyme-substrate%20database%20from%20GOA%20database.ipynb).

For "/ESP/data/enzyme_data/Uniprot_df_with_ESM1b.pkl" you would have to calculate the ESM-1b representations for all sequences using the fair-esm python package. If you have trouble doing this, I have uploaded this file for you here: https://drive.google.com/file/d/1BMMhgEQ0ILWoVJKEJLRVyLfxDmqYR9nW/view?usp=sharing

I hope this helps you to get/reproduce the missing files.

Good luck with your thesis! Best, Alex

EasternCaveMan commented 6 months ago

Dear Alex,

Thank you for your prompt response.

I've encountered an issue with the cluster folder. It appears that the number of sequences in the file all_sequence.fasta doesn't match with Uniprot_df.pkl. This discrepancy has left me somewhat confused. I attempted to map them back to Uniprot_df.pkl, but encountered a challenge: you've used the index as the ID for sequences, making it impossible for me to accurately map them back.

I want to understand the dataset you utilized for clustering with CD_HIT. Specifically, I'm curious whether you used solely experimental data or a combination of experimental and phylogenetic data. If the clustering was based solely on experimental data, I would greatly appreciate access to the dataset you employed for the clustering process. Alternatively, if you could provide guidance on how to generate this dataset, I'd be able to replicate the process exactly as you did. My intention is to experiment with different splitting strategies to compare results with yours. Thank you very much for your assistance.

Best regards, Vahid

AlexanderKroll commented 6 months ago

Dear Vahid,

I used the all sequences (including the ones with phylogenetic evidence) to ensure that the phylogenetic data in the training set is not too similar to the test set. If you have problems running the cd-hit algorithm, you could send me a sequences.fasta file and I can calculate the output of the cd-hit algorithm for you. In this case you could also set the IDs in a different way (I agree that how I did it was unfortunate).

Best, Alex

EasternCaveMan commented 6 months ago

Dear Alex, Thanks for information, it helped a lot. thanks for your offer but, I can run the cd-hit algorithm without any problem. Best, Vahid