Closed SamuelCahyawijaya closed 5 months ago
Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.
To whoever it may concern, in regards to the above github-actions' reminder, I apologise for the delay and any inconveniences caused by this. I am currently still working on this issue, please give some more time. Regards.
@fhudi may we know if you are still working on this issue? It has already been one month since your last update.
Hi @akhdanfadh, thanks for the reminder.
There are some files in the raw dataset that turned out to be empty file. I was in the process of downloading the whole combination from the 9 languages supported, to check and then ask the author for clarification, but somehow forgotten half-way.
Hi @ssun32,
I tried to create dataloader for your dataset but seems like there are empty files.
Could you please help checking the BI-139 for all queries in Indonesian (id), i.e. id → *
? 🙏
And also, regarding the license, it seems to be of unknown
value,
but from your CLIRMatrix site, it seems to be cc-by-4.0
as written in the footnote.
So which one is correct? 🙏
@fhudi It turns out there is zero overlap of the id-xx examples with the examples in the other language directions, probably due to incomplete Wikidata entries for ID when I created the dataset a few years ago. I recommend throwing away the language directions with empty files. Thanks for spotting the issue!
thanks @ssun32. What about the license?
@SamuelCahyawijaya, Need help 🙏
Shall we just remove the support to language ind
?
@fhudi Removed ind
in both issue ticket and datasheet.
@holylovenia @SamuelCahyawijaya
The dataset's task is TEXT_RETRIEVAL, so the seacrowd schema for this dataset is PAIRS as noted in the constants.py
However, it seems the schema is incorrect, as it seems the triplet contains a discrete numerical value for relevance score as defined follows:
Although PAIRS_SCORE seems to be more fitting, the dataset is formatted based on the task of IR (Information Retrieval), where multiple document ids and it's relevance score in a single triplet. Note that PAIRS_MULTI is categorical hence unfit.
I think we are not going to support the format of IR task, right? Because if we do, it will be problematic to load the whole document texts instead of IDs in the dataloader.
One of the immediate solution that I can think of, without much changes to the shared classes, is letting the relevance score represented as categorical.
But WDYT?
Hi @fhudi, I agree with you. Let's do source
-only for this dataloader. 👍
Dataloader name:
clir_matrix/clir_matrix.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?clir_matrix