chaoyi-wu / RadFM

The official code for "Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data".
331 stars 33 forks source link

the pretrain csv data for PMC-Inline #34

Open zihui-debug opened 4 months ago

zihui-debug commented 4 months ago

Hello, thanks for your contributory work! I find that there isn't a paper_train.csv in the data_csv.zip. Is the paper path in this csv file the same as the PMC-Inline text json file from your huggingface at https://huggingface.co/datasets/chaoyi-wu/PMC-Inline/tree/main ? If not, how can I get the paper_train.csv to perform pretraining? image image

chaoyi-wu commented 4 months ago

Thanks for the issue. I missed this in for former sharing and have uploaded into https://huggingface.co/datasets/chaoyi-wu/RadFM_data_csv/blob/main/paper_train.csv. Generally, we use all papers and just dismiss some papers used for generating the test set in other datasets to avoid potential data leakage.

zihui-debug commented 3 months ago

Thanks for the issue. I missed this in for former sharing and have uploaded into https://huggingface.co/datasets/chaoyi-wu/RadFM_data_csv/blob/main/paper_train.csv. Generally, we use all papers and just dismiss some papers used for generating the test set in other datasets to avoid potential data leakage.

thanks a lot! In addition, could you share the csv data for multi-label task? I didn't find these files in the sharing zip file: chestxray_new.csv pcxr_train_new.csv mammo_train_new.csv spinexr_train_new.csv image