Open bardout opened 3 years ago
I'd like to know how you deal with this problem.The names of the data I downloaded from the RSICD link you provided are numerically named, but the tag picture names in the .json file are really category + numeric annotations.I am not sure if my operation is wrong, can you give me some advice? Your data set is needed for the current work. Thank you very much for your quick reply.
Hello, As I moved projects, I have currently no access to this data. Sorry I can not check precisely the usage. However, I only modified captions in the json provided from the original paper. I did not change the image set structure. I recommend to check , if a later update corrected the same typo/errors, at this github and paperwithcode https://paperswithcode.com/dataset/rsicd. So the numbered files are in directory named with category and this can be handled in the dataloader, or via a script on files names, or in the inside the json.
De : rongtongxueya @.> Envoyé : mercredi 12 juillet 2023 06:21 À : 120343/modified @.> Cc : BARDOUT Yves @.>; Author @.> Objet : Re: [120343/modified] Correction of typo and unknown words in RSICD (#2)
I'd like to know how you deal with this problem.The names of the data I downloaded from the RSICD link you provided are numerically named, but the tag picture names in the .json file are really category + numeric annotations.I am not sure if my operation is wrong, can you give me some advice? Your data set is needed for the current work. Thank you very much for your quick reply. [json]https://user-images.githubusercontent.com/97833507/252854305-97999ee3-60fa-41a9-b7e4-969cc52eeec0.png [RSICD]https://user-images.githubusercontent.com/97833507/252854321-c00f33cf-2326-4eea-b9bf-4fa33e2271f4.png
— Reply to this email directly, view it on GitHubhttps://github.com/120343/modified/issues/2#issuecomment-1631824995, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACAWCFJFHRLXZBW3ATSJJL3XPYQ3BANCNFSM423RAMEA. You are receiving this because you authored the thread.Message ID: @.**@.>>
After having noted many unknown in original set, I switched to your set wich is much better. I applied a check on all tokens in the modified set. The english vocabulary is based on NLTK corpus: words (the ref unix for spelling) and wordnet (a synset thesaurus). I added 26 words in local_dict below, mostly composite word not properly hyphenated, and skipped the lookup for words within single quote, considered proper nouns. These are not significant in most use of model training, but I choose to keep the intended captions.
There are many issue with the tokens list as included in json. Instead I applied _nltk.tokenize.wordtokenize, removing the string.punctuation and separate hyphenated words (also split on '_') after. This output is now written in the token lists. Arguably, the composite words should be corrected in dataset raw captions.
From the remaining unknown 22 are corrected in the raw sentences before tokenization, using spelling corrector or looking up definitions (some botanical words, atrovirens, viridis, cataphractus, epinephelus, in latin; one bhur in Sanskrit?, and some rare word or meaning), as acclive, ruttling, see the dict below.
the marking in images are single quoted proper nouns, which break during tokenize, I enclosed them manually in double quote (protected by backslash), and listed these, altough an alternative is Capitalizing, and recognizing them.
Finally I reviewed images for which cryptic abreviation was used ['poq', 'plq', 'pyq', 'pxq', 'p+q', 'ptq', 'p#q' ] e.g: "building like poq and a building like plq and a building like p#q" to write a significant caption, using those terms round-shaped, L-shaped , Y-shaped, X-shaped, cross-shaped, T-shaped, hash-shaped,. Also the noun cataphractus is removed before adjective cataphracted, and 'square-189 jpg' is dropped from a description.
The produced file is uploaded as 'dataset_rsicd_v2.json'