-
At https://github.com/bigscience-workshop/bigscience we found 3835 records full of backslashes in OSCAR-en
My suspicion is that OSCAR downloaded a single webpage which was comprised of say 4B backs…
-
In addition to CDXJ, the [ZipNum format](https://github.com/ikreymer/pywb/wiki/CDX-Index-Format#zipnum-sharded-cdx) uses a secondary index, which also includes a sortable url key but contains other da…
-
I see you have got multi-core working. Any tips?
I have it working but at the expense of security though.
Any way to multi-core spider on a single db?
Have you played with any form of filter…
-
## 一言でいうと
事前学習した言語モデルを使用し代名詞の解決問題に答える研究。パソコンを鞄に入れようとしたが「それ」が大きすぎて入らなかった、という時「それ」がパソコンか鞄かを回答する形で、言語モデルを使用し「それ」を回答候補に置き換えた場合の文全体/置き換えた以後の単語の出現確率の変動を見る
以下の図は、文中の「it」を回答候補であるtrophy/suitcaseに置き換えた場合(…
-
People using the 'External' tab are most likely trying to discover what type of content people usually link to.
I think they should be categorized, likely using some sort of website classification AP…
-
### Version
1
### DataCap Applicant
DATADAO
### Project ID
DATADAO-02
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / Hea…
-
## タイトル: InfiMM-WebMath-40B:高度な数理推論のためのマルチモーダル事前学習の進歩
## リンク: https://arxiv.org/abs/2409.12568
## 概要:
大規模で高品質なデータセットを用いた事前学習は、大規模言語モデル (LLM) の推論能力、特に数学などの専門分野における能力を向上させる上で非常に重要です。その重要性は認識されているものの…
-
### Source Site
https://www.geograph.org.uk/
### Value Provided
Over 7 million photos of places in Ireland and the UK, aiming o cover every piece of land in a grid fashion "project aims to collect …
-
**Issue by [nanaya07](https://github.com/nanaya07)**
_Sat Jul 7 19:47:46 2018_
_Originally opened as https://github.com/codelucas/newspaper/issues/593_
----
Hi,
I am currently working on machine…
-
### Describe the bug
The recommended way to get the region in which S3 bucket is located is the HeadBucket call. But the region is specified in the HTTP headers of the response.
Follow up for #2…