Why should the number of SAP representation protein sequence file lines and the number of Canonical compound SMILE file lines match?

CallMeDek commented 2 years ago

Hi,

I am trying to get results of my own data with your model.

(1) According to the file "DeepAffinity_inference.sh", it seems that the number of lines for input protein sequences file and compound file must matches like below. 스크린샷, 2022-09-22 10-46-42 Is it mean that the number of each entity in both files have to be matched or literally the the number of lines of both files have to be matched?

(2) I got two files for my own data after following your manual. Could you tell me if their entities' structure are correct for model input?

CID_Smi_Feature:
protein_grouped_finalPresentation

Thank you, CallMeDek

Shen-Lab commented 2 years ago

The number of the protein sequences and that of compound files are asked to be equal because we are predicting given pairs of proteins and compounds. So the k-th row of the protein file is paired to the k-th row of the compound file. If you are interested in cross prediction for all combinations of given proteins and given compounds, you can write a simple script to prepare the two files (with repeats) without having to change our scripts. Otherwise you can change the script through the for loop (use nested for loops instead).

I am not sure what exactly you are asking in the second question. Please kindly detail your question and could @AstroSign please follow up if possible?

AstroSign commented 2 years ago

For the second question, your data looks good to me. Let me know if you encountered further issues.

On Sep 23, 2022, at 8:49 AM, Shen Lab at Texas A&M University @.***> wrote:

The numbers of the protein sequences and that of compound files are asked to be equal because we are predicting given pairs of proteins and compounds. So the k-th row of the protein file is paired to the k-th row of the compound file. If you are interested in cross prediction for all combinations of given proteins and given compounds, you can write a simple script to prepare the two files (with repeats) without having to change our scripts. Otherwise you can change the script through the for loop (use nested for loops instead).

I am not sure what exactly you are asking in the second question. Please kindly detail your question and could @AstroSign https://github.com/AstroSign please follow up if possible?

— Reply to this email directly, view it on GitHub https://github.com/Shen-Lab/DeepAffinity/issues/9#issuecomment-1256168887, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM5XI5EALQ4J45ZHC55PGLV7WRMZANCNFSM6AAAAAAQSTWOPA. You are receiving this because you were mentioned.

Shen-Lab / DeepAffinity

Why should the number of SAP representation protein sequence file lines and the number of Canonical compound SMILE file lines match? #9