Regarding to Data Source and Data Structure

chq1155 commented 2 years ago

Hi, I am facing a data source problem. I would like to apply my own data to your amazing model, but I cannot try to make the wrong data structure that fits your model' expected input.

Can I be provided the right data structure or just be shown the file "../data/domain_dict_full.pkl" to figure this problem out?

By the way, there is a bug on the file fold_feat_gen.py lines 42 and 48: variables 'start' and 'end' should be strings so as to fit function 'replace'. Similarly, the same file lines 83, 87, 89, and 91: information extraction of nested dictionary cannot be simply implemented by indexing as ss['seq'].

Looking forward to your precious reply. Many thanks!!

raiyan3 commented 2 years ago

Hi,

Here is a link to the "domain_dict.pkl" file for the OD dataset. https://drive.google.com/file/d/12vRaK6JevY7Rt4MdB2DUKb1JPmh-bFuv/view?usp=sharing

You may use the script -"Generate_domain_dict.py" from the following link, instead of the "fold_feat_gen.py". This will generate the domain dictionaries. https://github.com/raiyan3/ECEN_766-600_CourseProject_TAMU/blob/12d6dc503ee33e38bbd3f36b7a489adf0e7c4ce5/Fold2Seq/Generate_domain_dict.py

Afterwards, run the "ss_dense_gen_parallel.py" from the following link to generate feature files. You can adjust the "num_cores" parameter as needed. https://github.com/raiyan3/ECEN_766-600_CourseProject_TAMU/blob/379932db7c65febfe6e93352d3fbdba606b536f6/Fold2Seq/ss_dense_gen_parallel.py

Please let me know if you require any further clarifications.

Shen-Lab commented 2 years ago

Thank you, @raiyan3! The two .py files are from your private repo and not accessible. Let's update this public repo as needed and give their links here in future. Please take your time. P.S. I also wonder whether domain_dict_full.pkl was just for the OD set or actually for the entire dataset.

raiyan3 commented 2 years ago

Thank you, @Shen-Lab for the feedback. Regarding the query on *.pkl domain dictionaries -there should be four separate *.pkl files corresponding to the train, val, id, and od datasets. I've only shared the smallest dataset dictionary (od) since @chq1155 mentioned they'd like to use their own dataset. If needed, I can also share the link to the other dictionary files. The files are large enough to clutter direct sharing.

These are the latest links to the scripts mentioned previously.

https://github.com/Shen-Lab/Fold2Seq-icml2021/blob/c8bb81c500da1a8e452c9ff38a8d7ad2ffcbf4ae/data/ss_dense_gen_parallel_v2_0.py

https://github.com/Shen-Lab/Fold2Seq-icml2021/blob/c8bb81c500da1a8e452c9ff38a8d7ad2ffcbf4ae/data/Generate_domain_dict_v2_0.py

chq1155 commented 2 years ago

Solved! Thank you so much for the update!!

From: raiyan3 @.> Sent: Friday, June 3, 2022 12:04 To: Shen-Lab/Fold2Seq-icml2021 @.> Cc: CAO, Hanqun @.>; Mention @.> Subject: Re: [Shen-Lab/Fold2Seq-icml2021] Regarding to Data Source and Data Structure (Issue #1)

Thank you, @Shen-Labhttps://github.com/Shen-Lab for the feedback. Regarding the query on .pkl domain dictionaries -there should be four separate .pkl files corresponding to the train, val, id, and od datasets. I've only shared the smallest dataset dictionary (od) since @chq1155https://github.com/chq1155 mentioned they'd like to use their own dataset. If needed, I can also share the link to the other dictionary files. The files are large enough to clutter direct sharing.

These are the latest links to the scripts mentioned previously.

https://github.com/Shen-Lab/Fold2Seq-icml2021/blob/c8bb81c500da1a8e452c9ff38a8d7ad2ffcbf4ae/data/ss_dense_gen_parallel_v2_0.py

https://github.com/Shen-Lab/Fold2Seq-icml2021/blob/c8bb81c500da1a8e452c9ff38a8d7ad2ffcbf4ae/data/Generate_domain_dict_v2_0.py

— Reply to this email directly, view it on GitHubhttps://github.com/Shen-Lab/Fold2Seq-icml2021/issues/1#issuecomment-1145563641, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AU2RLYQFOZ725BOMEP3664TVNF73DANCNFSM5XIHYKBQ. You are receiving this because you were mentioned.Message ID: @.***>

raiyan3 commented 2 years ago

Glad to hear you're able to work, @chq1155. I'm going to mark the issue as closed.

Please feel free to reach out with any query any time.

chq1155 commented 2 years ago

Dear Raiyan3,

Can I have a trained model for sequence generation?

Many thanks for your time!

Best, Hanqun

From: raiyan3 @.> Sent: Friday, June 3, 2022 22:47 To: Shen-Lab/Fold2Seq-icml2021 @.> Cc: CAO, Hanqun @.>; Mention @.> Subject: Re: [Shen-Lab/Fold2Seq-icml2021] Regarding to Data Source and Data Structure (Issue #1)

Closed #1https://github.com/Shen-Lab/Fold2Seq-icml2021/issues/1 as completed.

— Reply to this email directly, view it on GitHubhttps://github.com/Shen-Lab/Fold2Seq-icml2021/issues/1#event-6737465045, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AU2RLYVT7GEL54OAHL5WEODVNILJ7ANCNFSM5XIHYKBQ. You are receiving this because you were mentioned.Message ID: @.***>

Shen-Lab commented 2 years ago

Thank you @raiyan3 . I would suggest not to share the file. We do not have the trained model for the reported paper from the first author and we are still trying to replicate it. That file was not verified. I will delete the comment.

chq1155 commented 2 years ago

Thank you for your reply. Let me just do the inference by the trained model for reference. Maybe I need to train it myself!

Thanks again!

获取 Outlook for iOShttps://aka.ms/o0ukef

发件人: Shen Lab at Texas A&M University @.> 发送时间: Tuesday, June 7, 2022 5:48:39 AM 收件人: Shen-Lab/Fold2Seq-icml2021 @.> 抄送: CAO, Hanqun @.>; Mention @.> 主题: Re: [Shen-Lab/Fold2Seq-icml2021] Regarding to Data Source and Data Structure (Issue #1)

Thank you @raiyan3https://github.com/raiyan3 . I would suggest not to share the file. We do not have the trained model for the reported paper from the first author and we are still trying to replicate it. That file was not verified. I will delete the comment.

― Reply to this email directly, view it on GitHubhttps://github.com/Shen-Lab/Fold2Seq-icml2021/issues/1#issuecomment-1147966239, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AU2RLYRZ7NWG6LOOMSI6KDTVNZW3PANCNFSM5XIHYKBQ. You are receiving this because you were mentioned.Message ID: @.***>

Shen-Lab / Fold2Seq-icml2021

Regarding to Data Source and Data Structure #1