TREMA-UNH / trec-car-tools

Tools for working with the TREC CAR dataset.
http://trec-car.cs.unh.edu/
BSD 3-Clause "New" or "Revised" License
36 stars 14 forks source link

about data release v1.4 #6

Open dilekc opened 7 years ago

dilekc commented 7 years ago

The paragraph 83423c198b6099edba08f185f940042d5dba3b79 is annotated as relevant to more than one section_ids although the following statement occurs in the track web page:

*.cbor.hierarchical.qrels: every paragraph is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)

cat release-v1.4/fold1.train.cbor.hierarchical.qrels | grep 83423c198b6099edba08f185f940042d5dba3b79

yields the following output

Joint%20University%20Programmes%20Admissions%20System/Difficulty 0 83423c198b6099edba08f185f940042d5dba3b79 1
Kawasaki%20Ki-100/Production 0 83423c198b6099edba08f185f940042d5dba3b79 1
Kawasaki%20Ki-61/Production 0 83423c198b6099edba08f185f940042d5dba3b79 1
Sports%20in%20San%20Antonio/NCAA%20college%20basketball 0 83423c198b6099edba08f185f940042d5dba3b79 1
Variational%20Bayesian%20methods/A%20more%20complex%20example 0 83423c198b6099edba08f185f940042d5dba3b79 1
bgamari commented 7 years ago

Has this been resolved?

laura-dietz commented 7 years ago

Here was my response (by email). Here for everyone else also:

Thanks so much for raising the flag. I think in this case it is not a bug, but due to some idiosyncrasies of the data:

All paragraphs that contain exactly the same (textual) content share the same id (i.e., they are indistinguishable)

This happens quite a bit, especially for short paragraphs such as "We use the example below."

You may want to look at the contents for "83423c198b6099edba08f185f940042d5dba3b79" see if this is the case here also.

Please do let me know if you see something suspicious.