Closed Wudicxy closed 6 months ago
aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz datasets/hiertext aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz datasets/hiertext missing train and validation datasets .eagerly await your response
What is the issue exactly? These two commands require the AWS CLI and will download the HierText train and validation data (although we only use the images). This is the download method provided by the authors in the original repository.
when i use aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz datasets/hiertext it will tell me ERR_TUNNEL_CONNECTION_FAILED (browser error:The website with the URL https://open-images-dataset.s3.cl.amazonaws.com/ocr/train.tgz may be temporarily unavailable, or it may have been permanently moved to a new URL.) Could you please tell me if this website is available?
This command should be used in a command line with the AWS CLI installed, you seem to be trying to access it via a browser. You can download these files via the browser with these links however:
https://open-images-dataset.s3.amazonaws.com/ocr/train.tgz https://open-images-dataset.s3.amazonaws.com/ocr/validation.tgz
you have to place the uncompressed files under datasets/hiertext
.
great thank for your answer
hello,can i ask a question .(KeyError: "Dataset 'hiertext_validation' is not registered!)could you tell my the solution?thank for your answer!
Hello again, yes I believe this was my mistake! The name of the validation split was wrongly registered in the file adet/data/builtin.py file; line 48 should be:
"hiertext_validation": ("hiertext/validation", "hiertext/validation.jsonl")
instead of the current:
"hiertext_val": ("hiertext/val", "hiertext/validation.jsonl")
I changed this line in the last commit, if you pull the latest changes of the repo the problem should be fixed. Please, tell me if that fixed your mistake.
yes. the line 48 is "hiertext_val": ("hiertext/validation", "hiertext/validation.jsonl").The solution is right.Thanks for your answer very much.
Hello, which dataset are you referring to? The download links to the HierText-based training/validation and our proposed test set can be found in this section of the readme:
https://github.com/CVC-DAG/STEP?tab=readme-ov-file#datasets
The following link contains the json with the GT for the training and validation:
While this one contains the test data: