dptech-corp / Uni-Fold

An open-source platform for developing protein models beyond AlphaFold.
https://doi.org/10.1101/2022.08.04.502811
Apache License 2.0
368 stars 69 forks source link

Errors in train_multi_label.json #108

Closed dingquanyu closed 1 year ago

dingquanyu commented 1 year ago

Hi,

I wonder how you generated the train_multi_label.json but there seem to be errors in this file. For example, 7a6o has two chains : A and B in pdb but in this file it's labeled as 7a6o_AAA and 7a6o_BBB. This and other similar mislabellings have given me errors. Could you maybe upload the script that generated this json file? Thanks.

guolinke commented 1 year ago

I just check the data, and cannot find the problem. In train_multi_label.json, it has:

    "7a6o_AAA": [
        "7a6o_AAA",
        "7a6o_A"
    "7a6o_BBB": [
        "7a6o_B",
        "7a6o_BBB"

In pdb_labels, it has:

7a6o_A.label.pkl.gz
7a6o_AAA.label.pkl.gz
7a6o_B.label.pkl.gz
7a6o_BBB.label.pkl.gz

In pdb_features, it has:

7a6o_AAA.feature.pkl.gz
7a6o_BBB.feature.pkl.gz
dingquanyu commented 1 year ago

I see. Sorry since downloading the full dataset never worked for us, I prepared all the features by myself and I extracted the chain names from mmcif files, which gave me 7a6o_A and 7a6o_B instead of 7a6o_AAA and 7a6o_BBB. Now it makes sense. Thanks for checking it.

guolinke commented 1 year ago

@henrywotton you now can download the full dataset from ByteDance hosted storage, check the README.