IBM / fold2seq

Code for Fold2Seq paper from ICML 2021
Apache License 2.0
49 stars 8 forks source link

How to generate domain data? #4

Open XZK9 opened 2 years ago

XZK9 commented 2 years ago

I can get a domain_seq.pkl containing coords and seq info from the code, but the dataset requires emb, padding and foldclass which the code can't generate. So how can I generate them , write my own code or is there some details that I miss ?

Krysta1 commented 2 years ago

I can get a domain_seq.pkl containing coords and seq info from the code, but the dataset requires emb, padding and foldclass which the code can't generate. So how can I generate them , write my own code or is there some details that I miss ?

Facing the same problem with you, have you figured out how to generate the emdbs, and other information?

XZK9 commented 2 years ago

I can get a domain_seq.pkl containing coords and seq info from the code, but the dataset requires emb, padding and foldclass which the code can't generate. So how can I generate them , write my own code or is there some details that I miss ?

Facing the same problem with you, have you figured out how to generate the emdbs, and other information?

I wrote my own vocab, dataset, dataloader just like a nlp task. The emb is just simple token index, you can generate the index in dataloader, so as the padding info. You can get the fold class label from the cath dataset from the cath website(cost me a lot of time to figure it out..., read README in the CATH website). BTW, the style of the released code is terrible...

Krysta1 commented 2 years ago

I can get a domain_seq.pkl containing coords and seq info from the code, but the dataset requires emb, padding and foldclass which the code can't generate. So how can I generate them , write my own code or is there some details that I miss ?

Facing the same problem with you, have you figured out how to generate the emdbs, and other information?

I wrote my own vocab, dataset, dataloader just like a nlp task. The emb is just simple token index, you can generate the index in dataloader, so as the padding info. You can get the fold class label from the cath dataset from the cath website(cost me a lot of time to figure it out..., read README in the CATH website). BTW, the style of the released code is terrible...

I agree. I also spent some time preprocessing the PDB file. I think it should be the /data/pdb_pre.py file to download the data needed and parse the information, save it into the domain_list.txt. However, there is no instruction that shows how to use this file. Do you create the script to preprocess the data? I will appreciate it if you could offer me a copy of it.

XZK9 commented 2 years ago

I can get a domain_seq.pkl containing coords and seq info from the code, but the dataset requires emb, padding and foldclass which the code can't generate. So how can I generate them , write my own code or is there some details that I miss ?

Facing the same problem with you, have you figured out how to generate the emdbs, and other information?

I wrote my own vocab, dataset, dataloader just like a nlp task. The emb is just simple token index, you can generate the index in dataloader, so as the padding info. You can get the fold class label from the cath dataset from the cath website(cost me a lot of time to figure it out..., read README in the CATH website). BTW, the style of the released code is terrible...

I agree. I also spent some time preprocessing the PDB file. I think it should be the /data/pdb_pre.py file to download the data needed and parse the information, save it into the domain_list.txt. However, there is no instruction that shows how to use this file. Do you create the script to preprocess the data? I will appreciate it if you could offer me a copy of it.

Actually my scripts are very fragmented, but there are some important parts. You can download pdb in pdb_list(train or valid) from cath using the method below.

def download_one_cath(label:str):
      # label like 1a6lA00
      print(label, end=' ')
      time.sleep(0.1)
      st = time.time()
      url = 'http://www.cathdb.info/version/v4_3_0/api/rest/id/%s.pdb'%(label)
      try:
          pdb_data = requests.get(url)
          with open('./pdbs/'+label, 'wb') as fw:
              fw.write(pdb_data.content)
      except Exception as e:
          return -1
      et = time.time()
      return et - st

You can get the fold label from cath-domain-list.txt downloaded from http://cathdb.info/wiki/doku/?id=data:index (check the readme on the webpage to determine which column). The fold 3d feature can be generated correctly using the provided scripts.

raiyan3 commented 2 years ago

I can get a domain_seq.pkl containing coords and seq info from the code, but the dataset requires emb, padding and foldclass which the code can't generate. So how can I generate them , write my own code or is there some details that I miss ?

Facing the same problem with you, have you figured out how to generate the emdbs, and other information?

I wrote my own vocab, dataset, dataloader just like a nlp task. The emb is just simple token index, you can generate the index in dataloader, so as the padding info. You can get the fold class label from the cath dataset from the cath website(cost me a lot of time to figure it out..., read README in the CATH website). BTW, the style of the released code is terrible...

I agree. I also spent some time preprocessing the PDB file. I think it should be the /data/pdb_pre.py file to download the data needed and parse the information, save it into the domain_list.txt. However, there is no instruction that shows how to use this file. Do you create the script to preprocess the data? I will appreciate it if you could offer me a copy of it.

Actually my scripts are very fragmented, but there are some important parts. You can download pdb in pdb_list(train or valid) from cath using the method below.

def download_one_cath(label:str):
   # label like 1a6lA00
   print(label, end=' ')
   time.sleep(0.1)
   st = time.time()
   url = 'http://www.cathdb.info/version/v4_3_0/api/rest/id/%s.pdb'%(label)
   try:
       pdb_data = requests.get(url)
       with open('./pdbs/'+label, 'wb') as fw:
           fw.write(pdb_data.content)
   except Exception as e:
       return -1
   et = time.time()
   return et - st

You can get the fold label from cath-domain-list.txt downloaded from http://cathdb.info/wiki/doku/?id=data:index (check the readme on the webpage to determine which column). The fold 3d feature can be generated correctly using the provided scripts.

Hello, I followed your discussion to extract the fold_class column from the cath-domains-list.txt file, then I used the column directly as fold_index. Is this the correct method?

Also, I haven't yet figured out how to generate the "embed" & "padding" field. Could you explain in more detail what to do/how you solved it?

XZK9 commented 2 years ago

I can get a domain_seq.pkl containing coords and seq info from the code, but the dataset requires emb, padding and foldclass which the code can't generate. So how can I generate them , write my own code or is there some details that I miss ?

Facing the same problem with you, have you figured out how to generate the emdbs, and other information?

I wrote my own vocab, dataset, dataloader just like a nlp task. The emb is just simple token index, you can generate the index in dataloader, so as the padding info. You can get the fold class label from the cath dataset from the cath website(cost me a lot of time to figure it out..., read README in the CATH website). BTW, the style of the released code is terrible...

I agree. I also spent some time preprocessing the PDB file. I think it should be the /data/pdb_pre.py file to download the data needed and parse the information, save it into the domain_list.txt. However, there is no instruction that shows how to use this file. Do you create the script to preprocess the data? I will appreciate it if you could offer me a copy of it.

Actually my scripts are very fragmented, but there are some important parts. You can download pdb in pdb_list(train or valid) from cath using the method below.

def download_one_cath(label:str):
   # label like 1a6lA00
   print(label, end=' ')
   time.sleep(0.1)
   st = time.time()
   url = 'http://www.cathdb.info/version/v4_3_0/api/rest/id/%s.pdb'%(label)
   try:
       pdb_data = requests.get(url)
       with open('./pdbs/'+label, 'wb') as fw:
           fw.write(pdb_data.content)
   except Exception as e:
       return -1
   et = time.time()
   return et - st

You can get the fold label from cath-domain-list.txt downloaded from http://cathdb.info/wiki/doku/?id=data:index (check the readme on the webpage to determine which column). The fold 3d feature can be generated correctly using the provided scripts.

Hello, I followed your discussion to extract the fold_class column from the cath-domains-list.txt file, then I used the column directly as fold_index. Is this the correct method?

Also, I haven't yet figured out how to generate the "embed" & "padding" field. Could you explain in more detail what to do/how you solved it?

Yes, I map the fold_index to fold_class(not the fold_index itself). I wrote a new vocab and dataset, so I'm not sure about if you need padding by yourself. Anyway, the protein sequences which are the input of the seq encoder should be padded just like any seq task for training in batch. The seq_encoder and decoder also need a emb layer (nn.embedding)just like a seq task, but it has already been defined in fold_classification_generator(line 99 and 106), so you don't need to define it.