RLGen / LakeBench

210 stars 4 forks source link

Is the pretrain stage of Deepjoin only designed for Opendata? #6

Open Mizryo opened 1 month ago

Mizryo commented 1 month ago

Dear all,

Thank you for huge effort towards this project!

I have a question about the implementation of the pretrain stage for Deepjoin. In multi_process_csv.py, the function process_before_train calls process_task4 which looks for tables of opendata regardless of what file you pass (either sato_opendata_new.csv or sato_webtable_new.csv) for the --tain_csv_file argument of deepjoin_train.py.

Could you teach me how to fix this issue if the functions are defined as expected, and if not, I would really appreciate if you could rewrite them to work for the webtable dataset.

Thank you in advance and best regards, Ryosuke

mutong184 commented 1 month ago

The problem is important. It happens because two datasets are structured differently. Fixing it is not too difficult. In the file multi_process_csv.py, changing the code at line 95 from "print("nofind this file", row[0])" and at line 109 from "print("nofind this file", row[1])" to "pass" might solve the problem.