debymf / ipa_probing

5 stars 2 forks source link

data folder missing #1

Open ivelin opened 1 year ago

ivelin commented 1 year ago

Congrats on the great paper! Really interesting to see how more general purpose models apply to specialized tasks without feature engineering.

I am trying to reproduce results and test agains web3 app data. Looks like the data folder was not uploaded to the repo even though it is referenced in the README:

The datasets with the splits used in the paper can be found inside the data folder.

However there is no data folder and respectively running the models fails with file not found errors:

python -m  layout_ipa.flows.layout_lm.layout_lm_train_pair_classification
[2022-12-30 18:14:17] INFO - prefect.FlowRunner | Beginning Flow run for 'Running the Transformers for Pair Classification'
[2022-12-30 18:14:17] INFO - prefect.FlowRunner | Starting flow run.
[2022-12-30 18:14:17] INFO - prefect.TaskRunner | Task 'PrepareRicoScaPair': Starting task run...
2022-12-30 18:14:17.883 | INFO     | layout_ipa.tasks.datasets_parse.rico_sca.rico_sca_pair_prep:run:42 - Preprocessing Rico SCA dataset from data/rico_sca/rico_sca_train.json

Could you please upload the data folder or provide steps to reproduce the training data.

Thank you! 🙏🏼

cc: @debymf

debymf commented 1 year ago

Hi @ivelin Thanks for your interest! Unfortunately, I later realized that the data set is copyrighted, and I can't make it available in my repository :( . You will have to follow the instructions here to generate it: https://github.com/google-research/google-research/blob/master/seq2act/data_generation/README.md

I will update the instructions later to reflect this.

ivelin commented 1 year ago

Hi @ivelin Thanks for your interest! Unfortunately, I later realized that the data set is copyrighted, and I can't make it available in my repository :( . You will have to follow the instructions here to generate it: https://github.com/google-research/google-research/blob/master/seq2act/data_generation/README.md

I will update the instructions later to reflect this.

Thank you for getting back to me and Happy New Year, @debymf . It seems like the original seq2act instructions are buggy. I see folks reporting issues trying to generate the dataset. I am also stumbling into several issues myself.

While researching options, I came across the Donut model and the UIBert dataset which is a new iteration on the RicoSCA dataset.

Have you thought about running your experiment with some of the newer multimodal OCR-free VDU models like Donut to see if RefExp performs well without preliminary OCR annotation of the document. A related point to test in an OCR-free environment would be whether the model can not only comprehend content and relationships between UI components with text labels but also components without any text such image, icon and avatar buttons.

That is the experiment I am currently working on. If you have any interest in it, would love to collaborate.