-
While this extension appeared to have been developed a while ago, at this point fact checking has gained support in metadata by use of the Schema.org [`ClaimReview` entity](https://schema.org/ClaimRev…
-
**Dataset Information:**
2021 will be the third year of this track. We already have 2019 (`clueweb12/b13/trec-misinfo-2019`), but should add 2020 (documents from CommonCrawl) and 2021 (TBD).
Thi…
-
Hello,
I'm using the "Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)" pre-trained vectors to replicate a study.
I ran the ``demo.sh`` smoothly, and I want to repro…
-
Is this project deprecated? I see there are no commits since 2013, and there appears to be a new index scheme available since 2015: http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
…
-
Is there a Java or a Kotlin binding for this library?
-
This bug (https://github.com/internetarchive/surt/issues/20) reported against the Python `surt` module is also present in the Java implementation.
This URL
`http://example.com/script?type=a+b+%26…
-
langstat2candidates.py, particularly when used with the `-candidates` parameter uses up large amounts of RAM (needing 32-64 GB of RAM for large language pairs). This is because it reads the entire can…
-
This loads a WARC file from local file system:
`val r: RDD[ArchiveRecord] = RecordLoader.loadArchives(path, sparkConf)`
How to load a WARC file from amazon S3?
I found this gist. Is this the correct…
-
When I execute:
`python -m cc_net --dump 2019-13`
Here is the full log. Err:
```makefile
2023-05-10 08:56 INFO 259781:cc_net.jsonql - preparing [, , ]
2023-05-10 08:56 INFO 259781:cc_net.jsonql…
-
I encountered a UnicodeDecodeError while using a Korean tokenizer integrated into our data processing pipeline. This issue seems to occur specifically when processing certain types of input data with …