embeddings-benchmark / arena

Code for the MTEB Arena
https://hf.co/spaces/mteb/arena
13 stars 5 forks source link

Final launch TODOs #12

Closed Muennighoff closed 1 month ago

Muennighoff commented 2 months ago

You can play with the space & retrieval models here: https://b3246e5ab28482f60e.gradio.live - Not all models & indices are cached yet so some first runs may be slow but once cached it should be blazing fast. Some TODOs below - would be great if we can get them done as fast as possible! 🚀

We're almost there! ❤️

2024-07-10 20:32:37 | ERROR | stderr |     docs = index.search(query_embeds=query_embed.tolist(), topk=topk)
2024-07-10 20:32:37 | ERROR | stderr |   File "/data/niklas/arena/retrieval/gcp_index.py", line 206, in search
2024-07-10 20:32:37 | ERROR | stderr |     response = self.endpoint.find_neighbors(
2024-07-10 20:32:37 | ERROR | stderr |   File "/env/lib/conda/gritkto/lib/python3.10/site-packages/google/cloud/aiplatform/matching_engine/matching_engine_index_endpoint.py", line 1551, in find_neighbors
2024-07-10 20:32:37 | ERROR | stderr |     response = self._public_match_client.find_neighbors(find_neighbors_request)
2024-07-10 20:32:37 | ERROR | stderr |   File "/env/lib/conda/gritkto/lib/python3.10/site-packages/google/cloud/aiplatform_v1beta1/services/match_service/client.py", line 758, in find_neighbors
2024-07-10 20:32:37 | ERROR | stderr |     response = rpc(
2024-07-10 20:32:37 | ERROR | stderr |   File "/env/lib/conda/gritkto/lib/python3.10/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
2024-07-10 20:32:37 | ERROR | stderr |     return wrapped_func(*args, **kwargs)
2024-07-10 20:32:37 | ERROR | stderr |   File "/env/lib/conda/gritkto/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 78, in error_remapped_callable
2024-07-10 20:32:37 | ERROR | stderr |     raise exceptions.from_grpc_error(exc) from exc
2024-07-10 20:32:37 | ERROR | stderr | google.api_core.exceptions.ServiceUnavailable: 503 recvmsg:Connection reset by peer
Screenshot 2024-07-10 at 3 00 37 PM Screenshot 2024-07-10 at 3 03 30 PM
orionw commented 2 months ago

Thanks @Muennighoff! This is exciting!

Fix some issues with the results: (1) For individual there is both model & model_name (2) side_by_side seems to only include one model (3) Should add corpus for retrieval - maybe @orionw ?

I think (2) is some of the older results that accidentally got uploaded -- nothing currently gets sent to "side_by_side". I'll nuke the results though and refresh with the new format.

To confirm this is what we want:

I'll remove model from the individual to keep it aligned with the others and add corpus. Do we want to add a default corpus to Clustering/STS also in case we eventually add it to those @Muennighoff?

Check if worth adding StackOverflow (a bit expensive due to its size)

We could also do a subset if we want, although two is a fine start.

spins up a new instance when requesting the GPUs I think

The GPU stuff seems concerning. Is there much documentation on how these Spaces with GPUZero work behind the scenes? Does it auto-provision AWS instances?


Also worth including in this checklist the Git LFS issues (can't store the results as pickle's) if we want to mirror from Github to Huggingface Space. I paid for this month of Git LFS bandwidth as we sort it out but I don't want to pay indefinitely (although it is cheap, $5 a month).

Muennighoff commented 2 months ago

Battle/Side by Side log to both individual and to battle

I think Battle & Side by Side are separate no? E.g. clustering_battle & clustering_side_by_side? We can also merge them if you think it is better.

We will probably have to nuke the results once more right before the launch as there will still be a bit of testing we'll do I assume.

We could also do a subset if we want, although two is a fine start.

Will look into it & let you know 👍

The GPU stuff seems concerning. Is there much documentation on how these Spaces with GPUZero work behind the scenes? Does it auto-provision AWS instances?

Some details are here: https://huggingface.co/spaces/zero-gpu-explorers/README & it is attached to our space so you can play with it if you want to.

I paid for this month of Git LFS bandwidth as we sort it out but I don't want to pay indefinitely

Oh I didn't notice this, sorry! Please feel free to merge the Git LFS PR & send me your bank account via Slack so I can wire you what you paid!

orionw commented 2 months ago

I think Battle & Side by Side are separate no? E.g. clustering_battle & clustering_side_by_side? We can also merge them if you think it is better.

Yeah this was confusing for me for a bit too.

The vote_last_response function is called for both battle and side_by_side when you click the button. Since it calls the same function, I sent them both to battle. I think in the leaderboard calculation it's excluded from calculation if the model names are non-anonymous (and thus side_by_side).

Would we rather send them to separate results sections based on if the name is anonymous? I don't have a preference either way and now is a good time to switch if we want.

Oh I didn't notice this, sorry! Please feel free to merge the Git LFS PR & send me your bank account via Slack so I can wire you what you paid!

Nah it's really minor and it was me who forgot about the Git-LFS bandwidth. Really it's so inexpensive I was considering if we should just get a $60 sponsoring donation on Github for a year from some company and not worry about it... Could be an alternative if we want to keep Git LFS as it really is a trivial amount to be sponsored. Something to keep in mind.

Muennighoff commented 2 months ago

Yeah this was confusing for me for a bit too. The vote_last_response function is called for both battle and side_by_side when you click the button. Since it calls the same function, I sent them both to battle. I think in the leaderboard calculation it's excluded from calculation if the model names are non-anonymous (and thus side_by_side). Would we rather send them to separate results sections based on if the name is anonymous? I don't have a preference either way and now is a good time to switch if we want.

I think reusing the vote_last_response func is fine & we can separate them after? Is it a mistake with https://huggingface.co/datasets/mteb/arena-results then that it currently has side_by_side and battle? I do think it's nice to have them separated there into side_by_side & battle. I think that we should start with only anonymous results counting and later consider adding non-anon depending on the traction.

$60 sponsoring donation on Github for a year from some company and not worry about it

Yes, given we spent $5K+ already on the indices this should be easily doable, lmk!

Muennighoff commented 2 months ago

I've made a new space that utilizes local GPUs and is blazing fast: https://huggingface.co/spaces/mteb/arena-tmp . Unfortunately, I think this will break the GitHub syncing from this repository. Instead, I think I will have to just manually pull from the repo frequently. Are you okay with that @orionw ? Curious about your thoughts, too. If fine with you, maybe you can stop the syncing & rename the space. Or maybe we can sync it with my local clone somehow? Maybe just a crontab or something so I regularly push the results & sync the space 🤔

Also if someone has any cool idea for what we could do with the Zero GPU allocation, feel free to propose something / make use of them!

orionw commented 2 months ago

🔥🔥🔥 Thanks @Muennighoff, works great!

Are you okay with that @orionw ? Curious about your thoughts, too. If fine with you, maybe you can stop the syncing & rename the space.

Of course! Bummer the ZeroGPUs didn't work :/ Feel free to rename the other space and move this one in. We can stop the syncing by removing the .github actions file.

orionw commented 2 months ago

Add arXiv random samples (see https://github.com/embeddings-benchmark/arena/issues/5) - maybe @orionw ?

Did you already implement this? Looks like I see it, thanks!

I've updated the corpus without the newlines, you can see them here (although I guess you can't see without loading manually haha). Here's a screenshot:

image

do think it's nice to have them separated there into side_by_side & battle.

I've separated them and made a PR in #13. The data issue will take some coordination, perhaps we do that over messaging (some notes in the PR).

Muennighoff commented 2 months ago

🔥🔥🔥 Thanks @Muennighoff, works great!

Are you okay with that @orionw ? Curious about your thoughts, too. If fine with you, maybe you can stop the syncing & rename the space.

Of course! Bummer the ZeroGPUs didn't work :/ Feel free to rename the other space and move this one in. We can stop the syncing by removing the .github actions file.

I removed the .github folder for now, feel free to revert if this was wrong. Hopefully, we can bring it back one day. We can use its code though to setup a bash script that updates the LB and pushes changes and runs via crontab every 24h or so.

Did you already implement this? Looks like I see it, thanks!

Not yet! They are just the samples from Wikipedia atm 😅

I've updated the corpus without the newlines, you can see them here (although I guess you can't see without loading manually haha).

Looks great! Are you confident it is better & good as is i.e. I can go ahead and recreate all arXiv indices?

isaac-chung commented 2 months ago

Sorry I'm a bit late to the party!

a) occasional Connection reset by peer error

this mostly happens when two queries come in at the same time / closely after on another

[initial thought] I wonder if the queries are targeting different indices? I could imagine a scenario where the same endpoint is busy when it needs to load different indices. [follow up] Docs claim that "A maximum of 20 indexes can be deployed on a single endpoint." This leads me to believe that the cause lies elsewhere.

but it could also be that I set max replica node count to 1 which means it cannot autoscale

I wonder if these requests are the first requests to their respective indices in a while (maybe 10min, if the index endpoints are behind a load balancer)? The desired index may not be available if it's a cold start scenario. The default Minimum replica count is 2 i think, but that'll double the serving cost.

b) Index Cost

Some example cost calculations from GCP pricing page:

  1. arvix + sentence-transformers/all-MiniLM-L6-v2:
    • building cost: 2,511,805 rows 384 dims 4 bytes 1GiB/1,073,741,824 bytes $3/GiB = $10.78
    • serving cost per month: us_east1, e2-standard-16 $0.75/hr * 730 hrs = $547.5
  2. arxiv + intfloat/e5-mistral-7b-instruct:
    • building cost: 2,511,805 rows 4096 dims 4 bytes 1GiB/1,073,741,824 bytes $3/GiB = $114.99
    • serving cost per month: $547.5 (same as above)

The docs say the serving cost is the "Per node hour pricing for each VM used to host a deployed index". I wonder if that's per endpoint or per index if we're using the same endpoint for all indices.

isaac-chung commented 2 months ago

Run BM25S on MTEB Retrieval

I can take a look this weekend on adding bm25s (https://github.com/embeddings-benchmark/arena/issues/6 / https://github.com/embeddings-benchmark/mteb/issues/990)

Muennighoff commented 2 months ago

I wonder if these requests are the first requests to their respective indices in a while (maybe 10min, if the index endpoints are behind a load balancer)? The desired index may not be available if it's a cold start scenario. The default Minimum replica count is 2 i think, but that'll double the serving cost.

Yes I think they are usually the first ones in a while. Indeed maybe we should try increase the minimum replica count and see if it goes away.

Cost

That looks about right. We've spent around 6K USD atm.

Run BM25S on MTEB Retrieval

I can take a look this weekend on adding bm25s (#6 / embeddings-benchmark/mteb#990)

Amazing! There's also results in the paper Table 3: https://arxiv.org/pdf/2407.03618 ; I think we are using the 1.5 0.75 Lucene variant, so you can just pick the results from there and maybe put them in a result file like e.g. here, add them to the results repo so they will show up on the LB, and then we also add the average score here.

orionw commented 2 months ago

Looks great! Are you confident it is better & good as is i.e. I can go ahead and recreate all arXiv indices?

Yup, did some validation and it all looks great! Make sure it's the new one though here which has the whitespace normalized from abstract and titles: https://huggingface.co/datasets/orionweller/arxiv_7_2_24

Muennighoff commented 2 months ago

Nice will recreate those indices.

Also fixed the broken performance of SFR & Grit via the changes here; screenshot attached. The problem was that no query instruction was being used, which really matters a lot for them.

Screenshot 2024-07-13 at 2 17 07 PM Screenshot 2024-07-13 at 2 18 12 PM
Muennighoff commented 2 months ago

For the random sample, I think we want to make it such that it is also random whether it is arXiv or Wikipedia? I.e. you press random sample and then a random corpus is selected & an accompanying random sample. Similar to how for Clustering the random sample button also changes the number of clusters. Currently, you select a corpus and then it will provide a random sample (which are all wikipedia-targeted nq samples atm i.e. still missing arxiv).

Muennighoff commented 1 month ago

@orionw can you create a subsampled version of Stack Exchange that is similar in size to Wikipedia & arXiv? (they are about the same). Maybe we can be smart about the subsampling & remove exchanges that are likely less interesting to our expected users 🤔

I think having Stack Exchange is would be great - it is a really nice & different source. After a few weeks we can think about upsizing it / adding other corpora depending on user feedback & monitoring corpus usage.

Muennighoff commented 1 month ago

@orionw it seems like the StackExchange corpus often contains multiple answers? Maybe only keeping the top answer would make it much smaller likely suffice. I think most voters won't have the patience to read through all answers anyways.

>>> print(ds['train'][3]['text'])
Q: How can I make focus follow the mouse cursor? I will often click on a button expecting it to be clicked but instead all that happens is the application it is in becomes active, and I have to click again to actually click the button.  It would be nice if this second click wasn't needed, which leads me to my question:

How can I make it so that when I move the mouse cursor over an inactive window, it becomes active?

A: This is freely possible for the Terminal and X11 :

defaults write com.apple.Terminal FocusFollowsMouse -string YES
defaults write com.apple.x11 wm_ffm true

Or, OS-wise, with a utility that seems to fit your needs, called MondoMouse.

A: I originally wanted to do this with my first Mac a couple years ago as well, since that's how my Linux and Windows environments behave.  But I think the driving force preventing this from becoming a reality is in how OS X handles application menus.
What if you want to go to the menu at the top of the screen for an application you're using, but in the process briefly hover over another application?  That would become infuriating quickly.
In short, I don't think its doable for that and potentially other reasons.

A: Best little utility I stumbled upon is Zooom/2. Strange name, hence hard to find. You can choose delay (Rather cumbersome, OS X and global menu is not designed to allow that). I set it to focus window under cursor instantly when Option key is pressed. Great value, no dock or tray icons, it just works.

A: Amethyst (https://github.com/ianyh/Amethyst) is excellent. 
Follow the README.md instructions to download, and then enable "Focus Follows Mouse" in the Misc. section of the Settings view.

>>> print(ds['train'][2]['text'])
Q: Repair Disk - Start up disk options I had a power failure and upon rebooting noticed that the OS drive needed to be repaired (Disk Utilities). I am running Snow Leopard and don't have the CD to start up from in order to perform the fix.
Are there any other options for running the repair utils on the startup disk?

A: One option would be to clone your startup drive to an external disk using something like SuperDuper! or Carbon Copy Cloner. Then you can use System Preferences->Startup Disk to select that external drive as the boot drive. 
Once you've rebooted and are running the system off the external drive you can use Disk Utility to run the repair. After you're done, re-select the internal drive as the Startup Disk and reboot.

A: One option that doesn't require any external drives or disks:
Disk Utility's repair disk is largely* a thin wrapper over the unix fsck (stands for "File System Check") utility.  You can run it by:

*

*Booting into "Single User Mode" by rebooting and holding command-S during startup.

*A command-line input will appear; enter /sbin/fsck -fy

*Wait for it to complete.  If you see **** FILE SYSTEM WAS MODIFIED ***** then run it again, since sometimes fixing the first errors will uncover more.

*Repeat until it says that the disk appears to be ok.

*Enter Reboot to boot normally.

*I can't find any indication that Disk Utility's "Repair Disk" function does anything that fsck doesn't.  Nonetheless, Apple recommends that you use Disk Utility instead when that is an option.
Muennighoff commented 1 month ago

Also it seems like there are no titles for the Stack Exchange corpus --- Shouldn't we be able to get the question titles?

Muennighoff commented 1 month ago

Also if we could add the Exchange that each sample is from that'd be great I think. Could put it in the title or sth; would likely improve results a lot.

orionw commented 1 month ago

For sure, these are good suggestions @Muennighoff! Probably the weirdness of the data comes from how RedPajamas/Olmo did the formatting. There’s probably an original sample that they got it from if we want to do more modifications.

Unfortunately I won’t have cycles for this until Friday, currently at the SIGIR conference. Sorry! If that’s too late feel free to have someone else take a look at it.

KennethEnevoldsen commented 1 month ago

@Muennighoff added the google API here

Muennighoff commented 1 month ago

Unfortunately I won’t have cycles for this until Friday, currently at the SIGIR conference. Sorry! If that’s too late feel free to have someone else take a look at it.

I think that's fine! We can delay launch to next week. Also, some of the Wikipedia passages are really long and seem to go beyond sections, e.g. I got the below corresponding to this Wikipedia page. I think this is problematic as people don't want to spend a few mins to vote but be quick thus the texts need to be short. Don't we want to (a) Split on passages i.e. the first part should be 4 samples (b) Split on sections i.e. the Background should be its own single sample (c) Exclude subsequent section headers (i.e. the trailing Program history)

Screenshot 2024-07-16 at 1 23 53 PM
Title: Gaganyaan

Passage: Gaganyaan (; from Sanskrit: , "celestial" and , "craft, vehicle") is an Indian crewed orbital spacecraft intended to be the formative spacecraft of the Indian Human Spaceflight Programme. The spacecraft is being designed to carry three people, and a planned upgraded version will be equipped with rendezvous and docking capabilities. In its maiden crewed mission, the Indian Space Research Organisation (ISRO)'s largely autonomous 5.3-metric ton capsule will orbit the Earth at 400 km altitude for up to seven days with a two- or three-person crew on board. The first crewed mission was originally planned to be launched on ISRO's HLVM3 rocket in December 2021. As of October 2023, it is expected to be launched by 2025.
The Hindustan Aeronautics Limited (HAL)-manufactured crew module underwent its first uncrewed experimental flight on December 18, 2014. design of the crew module has been completed. Defence Research and Development Organisation (DRDO) will provide support for critical human-centric systems and technologies such as space-grade food, crew healthcare, radiation measurement and protection, parachutes for the safe recovery of the crew module, and the fire suppression system.
On June 11, 2020, it was announced that the first uncrewed Gaganyaan launch would be delayed due to the COVID-19 pandemic in India. The overall timeline for crewed launches was expected to remain unaffected. ISRO chairman S. Somanath announced in 2022 that the first crewed mission would not take place until 2024 at the earliest because of safety concerns.
The Gaganyaan Mission will be led by V. R. Lalithambika, the former Director of the Directorate of the Human Spaceflight Programme with ISRO Chairman S Somnath and S. Unnikrishnan Nair, Director of Vikram Sarabhai Space Centre. Imtiaz Ali Khan superseded V. R. Lalithambika as the Director of the Directorate of Human Spaceflight Programme.
Background
In 1984, Rakesh Sharma became the first Indian born citizen to enter space through a joint Interkosmos mission between ISRO and Soviet space program, when he flew aboard the Soviet rocket Soyuz T-11 launched from Baikonur Cosmodrome in the Kazakh Soviet Socialist Republic on 3 April 1984. The Soyuz T-11 spacecraft carrying cosmonauts including Sharma docked and transferred the three member Soviet-Indian international crew, consisting of the ship's commander, Yury Malyshev, and flight engineer, Gennadi Strekalov, to the Salyut 7 Orbital Station. Sharma spent 7days, 21hours, and 40minutes aboard the Salyut 7. He conducted an Earth observation program concentrating on India. He also did life sciences and materials processing experiments, including silicium fusing tests.
To commemorate the occasion special stamps and first day covers were released by the Government of India and Soviet Union.
Program history
orionw commented 1 month ago

Also, some of the Wikipedia passages are really long and seem to go beyond sections, e.g. I got the below corresponding to this Wikipedia page.

Everything that should be really easy to do with Wikipedia is unfortunately surprisingly difficult :( Their main way of providing data dumps is in a special format called MediaWiki which has caused so much pain that the main parser is called mwparserfromhell. They don't support their own parser and so this is the standard package if you need control of details.

(a) Split on passages i.e. the first part should be 4 samples

This is solvable, depending on what you mean by passages. If you mean the word count, we currently allow up to 500 words. This is 426 ish words. We can definitely re-run and reduce the passage size if you think it's too big. Do you have a size in mind @Muennighoff?

(b) Split on sections i.e. the Background should be its own single sample (c) Exclude subsequent section headers (i.e. the trailing Program history)

These are un-intuitively actually really hard. It's not possible to easily tell which is a section vs which is a list item or just a short sentence - they come in formats with newlines only and sections are often only demarcated by double newlines which also exist in other places.

I can separate on double newlines or short sentence heuristics but it will definitely break on list items and other edge cases and lead to a bunch of other problems. Similarly, for the trailing ending header we also have no way of telling if it's a list item, a short sentence, or a heading.

I think this is problematic as people don't want to spend a few mins to vote but be quick thus the texts need to be short

I agree, but worth noting that this is a trade-off, if we do shorter texts they won't be attached to their headings in as many cases as we will have more chunks.

Existing retrieval datasets use Wiki (DPR, KILT), do 100 word passages, and aren't connected to the headings unless it is the first heading.


Overall unless we have a Wikipedia parsing expert this is a really hard problem (weeks to months to build IMO and I don't have bandwidth to spend that much time on a parser).

I would suggest we adopt DPR's/KILT's Wikipedia set if we don't want to be deal with these issues and just be okay with the short context and lack of contextualization, otherwise we will have to deal with the other end of the problem which is the multiple sections cases we have here. For the trailing section headers, I think we will have that in any option we choose.

If we wait another few months, one of my labmates will have a version that maintains the hierarchy with headers for MegaWika v2 and we can adapt it but otherwise we are stuck with many bad heuristic options :/

orionw commented 1 month ago

@Muennighoff feel free to check out the KILT version here: https://huggingface.co/datasets/facebook/kilt_wikipedia

Muennighoff commented 1 month ago

Great explanation - And the current MegaWika is also worse?

Else I see why it makes sense to stick with the existing setup, but maybe let's reduce to 200 words, what do you think? You can try a few samples in the arena if you want. I think they are a bit too long atm. arXiv meanwhile is good length-wise I think. Also cc @isaac-chung @KennethEnevoldsen for opinions.

A few examples:

Title: Airborne fraction

Passage: Discussion about the trend of airborne fraction
Anthropogenic CO2 that is released into the atmosphere is partitioned into three components: approximately 45% remains in the atmosphere (referred to as the airborne fraction), while about 24% and 31% are absorbed by the oceans (ocean sink) and terrestrial biosphere (land sink), respectively. If the airborne fraction increases, this indicates that a smaller amount of the CO2 released by humans is being absorbed by land and ocean sinks, due to factors such as warming oceans or thawing permafrost. As a result, a greater proportion of anthropogenic emissions remains in the atmosphere, thereby accelerating the rate of climate change. This has implications for future projections of atmospheric CO2 levels, which must be adjusted to account for this trend. The question of whether the airborne fraction is rising, remaining steady at approximately 45%, or declining remains a matter of debate. Resolving this question is critical for comprehending the global carbon cycle and has relevance for policymakers and the general public.
The quantity “airborne fraction” is termed by Charles David Keeling in 1973, and studies conducted in the 1970s and 1980s defined airborne fraction from cumulative carbon inventory changes as,
Or,
In which C is atmospheric carbon dioxide, t is time, FF is fossil-fuel emissions and LU is the emission to the atmosphere due to land use change.
At present, studies examining the trends in airborne fraction are producing contradictory outcomes, with emissions linked to land use and land cover change representing the most significant source of uncertainty. Some studies show that there is no statistical evidence of an increasing airborne fraction and calculated airborne fraction as,
Where Gt is growth of atmospheric CO2 concentration, EFF is the fossil-fuel emissions flux, ELUC is the land use change emissions flux.
Another argument was presented that the airborne fraction of CO2 released by human activities, particularly through fossil-fuel emissions, cement production, and land-use changes, is on the rise. Since 1959, the average CO2 airborne fraction has been 0.43, but it has shown an increase of approximately 0.2% per year over that period.
On the other hand, the findings of another group suggest that the CO2 airborne fraction has declined by 0.014 ± 0.010 per decade since 1959. This indicates that the combined land-ocean sink has expanded at a rate that is at least as rapid as anthropogenic emissions. The way they calculated the airborne fraction is:
Where, AF is airborne fraction and SF is sink fraction. ELULCC is the land use and land cover change emissions flux, EFF is the fossil-fuel emissions flux, and SO and SL are the ocean and land sinks, respectively.
The trend analyses of airborne fraction may be affected by external natural occurrences, such as the El Niño-Southern Oscillation (ENSO), volcanic eruptions, and other similar events. It is possible that the methodologies used in these studies to analyze the trend of airborne fraction are not robust, and therefore, the conclusions drawn from them are not warranted.
Title: Federal government of the United States

Passage: Since the American Civil War, the powers of the federal government have generally expanded greatly, although there have been periods since that time of legislative branch dominance (e.g., the decades immediately following the Civil War) or when states' rights proponents have succeeded in limiting federal power through legislative action, executive prerogative or by a constitutional interpretation by the courts.
One of the theoretical pillars of the U.S. Constitution is the idea of "checks and balances" among the powers and responsibilities of the three branches of American government: the executive, the legislative, and the judiciary. For example, while the legislative branch (Congress) has the power to create law, the executive branch under the president can veto any legislation—an act which, in turn, can be overridden by Congress. The president nominates judges to the nation's highest judiciary authority, the Supreme Court (as well as to lower federal courts), but those nominees must be approved by Congress. The Supreme Court, in turn, can invalidate unconstitutional laws passed by the Congress. These and other examples are examined in more detail in the text below.
Legislative branch
The United States Congress, under Article I of the Constitution, is the legislative branch of the federal government. It is bicameral, comprising the House of Representatives and the Senate.
Makeup of Congress
House of Representatives
The U.S. House of Representatives is made up of 435 voting members, each of whom represents a congressional district in a state from where they were elected. Apportionment of seats among the 50 states is determined by state populations, and it is updated after each decennial U.S. Census. Each member serves a two-year term.
In order to be elected as a representative, an individual must be at least 25 years of age, must have been a U.S. citizen for at least seven years, and must live in the state that they represent.
In addition to the 435 voting members, there are six non-voting members, consisting of five delegates and one resident commissioner. There is one delegate each from Washington, D.C., Guam, the Virgin Islands, American Samoa, the Commonwealth of the Northern Mariana Islands, and a resident commissioner from Puerto Rico.
Unlike the U.S. Senate, all members of the U.S. House must be elected and cannot be appointed. In the case of a vacancy, the seat must be filled through a special election, as required under Article 1 of the U.S. Constitution.
Senate
In contrast, the Senate is made up of two senators from each state, regardless of population. There are currently 100 senators (2 from each of the 50 states), who each serve six-year terms. Approximately one-third of the Senate stands for election every two years.
If a vacancy occurs, the state governor appoints a replacement to complete the term or to hold the office until a special election can take place.
Separate powers
Title: The Big Fat Quiz of the Year

Passage: Danny Dyer appeared in the studio to provide a live guest question. Pre-recorded guest questions were provided by Russell Brand, Anchorman 2 stars Steve Carell, Will Ferrell, and Paul Rudd; Olly Murs, Christine Ohuruogu, Louis Walsh, Richard Osman, The Great Gonzo (promoting Muppets Most Wanted), Harry Hill, and Sophie Ellis-Bextor. Educating Yorkshire teachers Mr Mitchell and Mr Burton, The Great British Bake Off series 4 runner-up Ruby Tandoh, Rizzle Kicks, and astronaut Chris Hadfield. The children of Mitchell Brook Primary School returned to act out Edward Snowden's spy leaks. Jon Snow reported on "Wrecking Ball" and Charles Dance read from the autobiography of Lauren Goodger. The mystery guest was Natalie Holt, who threw eggs at Simon Cowell on the final of Britain's Got Talent.
The show was dedicated to comedy agent and producer Addison Cresswell, who died on 22 December 2013.
Jonathan Ross brought most of a turkey, a loaf of bread and champagne. He ended up making sandwiches for the others.
2014
The 2014 edition was recorded on 1 December and aired on 26 December 2014.
Pre-recorded guest questions came from Michael Palin, Tom Daley, the cast of The Inbetweeners, Game of Thrones actress Natalie Dormer, Lily Allen, Rio Ferdinand, Pixie Lott and Status Quo members Francis Rossi and Rick Parfitt. Paralympic gold medallists Kelly Gallagher and Charlotte Evans provided the in-studio guest question. Charles Dance read from the autobiography of Joey Essex. The children of Mitchell Brook Primary School acted out the Bernie Ecclestone trial. Jon Snow gave his news report about "All About That Bass". The mystery guest was Dean Farley, the jogger who ran into David Cameron.
Mel B's performance received notable negative attention on social media and in the press as having brought down the show by being perceived as sour and humorless.
2015
The 2015 edition was recorded on 14 December 2015 and aired on 26 December 2015. The teams did not take names.
Pre-recorded guest questions came from Quentin Tarantino, Rita Ora, Simon Pegg, Will Ferrell and Mark Wahlberg, Josh Groban, Olly Murs, Katie Price and Heston Blumenthal. The Great British Bake Off winner, Nadiya Hussain, provided the in-studio guest question. The children of Mitchell Brook Primary School acted out Jeremy Clarkson's dismissal from Top Gear. Jon Snow reported on Drake's "Hotline Bling". Charles Dance read from List of the Lost, the debut novel by Morrissey. The mystery guest was Cecilia Bleasdale, who took a photo of a black and blue dress which appeared white and gold to some people on the photo, leading to the dress becoming an internet meme. Davies and Ayoade's early answer "Bad Dong" becomes a running joke throughout the episode.
One of the running gags is the panelists deciding to make an alliance against Jimmy by helping Richard & Greg with getting their answers deliberately wrong. Rob even goes far as becoming the host of the program for two minutes however the alliance breaks down after a debate about The Dress.
2016
Title: Gossip Girl

Passage: The success of Gossip Girl led to many adaptations outside the United States. The series received numerous award nominations and won 18 Teen Choice Awards. The CW officially renewed Gossip Girl for a sixth and final season on May 11, 2012. The final season, consisting of 10 episodes, premiered on October 8, 2012, and ended on December 17, 2012.
Premise
The series focuses on a group of privileged teenagers who attend a prestigious high school in the Upper East Side of New York City as their private lives are constantly commented upon by an unknown blogger under the pseudonym "Gossip Girl".
Gossip Girl chronicles the scandals and intimate details of these characters' lives during high school, college, and after. All of their ups and downs are available for the public to read about. Throughout this time, the characters strive to unveil Gossip Girl's true identity.
Episodes
Cast and characters
Main
Blake Lively as Serena van der Woodsen, a student at the Constance Billard School for Girls. She is an it girl who frequently receives media attention.
Leighton Meester as Blair Waldorf, the queen bee of Constance Billard. She is best friends with Serena and highly focused on status, wealth and academic achievement. Her relationship with Chuck is a key theme throughout all six seasons.
Penn Badgley as Dan Humphrey, an outcast student at St. Jude's School for Boys. Dan initially does not fit in with the Upper East Side teenagers as he lives in Brooklyn and is not a legacy student, but rather attends St. Jude's with a partial scholarship. Dan aspires to be a writer.
Chace Crawford as Nate Archibald, a student at St. Jude's, Blair's childhood boyfriend, and the UES golden boy.
Taylor Momsen as Jenny Humphrey (seasons 1–4; guest, season 6), a student at Constance Billard's and Dan's younger sister. Jenny dreams of becoming a fashion designer, and begins as one of Blair's minions in order to gain status. She later rejects the Upper East Side life and becomes rivals with Blair and sleeps with Chuck.
Ed Westwick as Chuck Bass, a student at St. Jude's. He is the son of one of New York's most successful real estate moguls. Decadent and amoral, Chuck is mainly interested in women and alcohol. Once his father dies in the second season, he inherits Bass Industries and becomes a young billionaire. He is romantically involved with Blair throughout the series but they do not start officially dating until the third season. Blair and Chuck's relationship is a key theme throughout all six seasons.
Kelly Rutherford as Lily van der Woodsen (née Rhodes), Serena and Eric's mother and a three-time divorcée. A former photographer, Lily has become one of the UES's most influential socialites. She and Serena often have a strained and rocky relationship.
KennethEnevoldsen commented 1 month ago

Hmm for me it is not the length in the examples but rather the formatting:

E.g. why denote the passage/title (it should be self evident for the structure) e.g.

Gossip Girl

The success of Gossip Girl led to many ...

Works just fine (for me at least). Additionally, I would also take a look at the oddities:

[...]
Episodes
Cast and characters
Main
[...]

Which seems like a faulty scrape/parsing. This makes it hard to reason about the quality of the retrieval as the document itself seems to be low quality. I would probably either wait for MegaWiki v2 or use KILT

re length: I probably do agree that the retrieved document is too long as well, but I am unsure whether a better structure would solve it.

Muennighoff commented 1 month ago

Fixed the Connection reset error by just retrying; usually after 1 retry it works fine:

2024-07-18 14:24:01 | ERROR | index_logger | Error in find_neighbors: 503 recvmsg:Connection reset by peer. Retries left: 
4                                                                                                                         
2024-07-18 14:24:28 | INFO | gradio_retrieval | bothbadvote (named). ip:                                                  
2024-07-18 15:27:01 | ERROR | stderr | WARNING:  Invalid HTTP request received.                                           
2024-07-18 15:27:24 | INFO | gradio_retrieval | Retrieval. ip:                                                            
2024-07-18 15:27:24 | ERROR | index_logger | Error in find_neighbors: 503 recvmsg:Connection reset by peer. Retries left: 
4                                                                                                                         
2024-07-18 15:27:24 | ERROR | index_logger | Error in find_neighbors: 503 recvmsg:Connection reset by peer. Retries left: 
4                                                                                                                         
2024-07-18 15:28:56 | INFO | gradio_retrieval | Retrieval. ip:                                                            
2024-07-18 15:28:56 | ERROR | index_logger | Error in find_neighbors: 503 recvmsg:Connection reset by peer. Retries left: 
4                                                                                                                         
2024-07-18 15:30:58 | INFO | gradio_retrieval | Retrieval. ip:                                                            
2024-07-18 15:30:58 | INFO | model_logger | Using instruction: Given a query, retrieve a relevant title and passage from W
ikipedia                                                                                                                  

Also added arXiv random samples, lmk if thoughts.

So only Stack Exchange corpus, fixing the result files & some final indices left I think 🙌

Muennighoff commented 1 month ago

I've pushed all changes to main - I've added a new TODO which is fixing the scrolling bug in the below video. Are you getting it too? Would be amazing if someone has bandwidth to take a look, maybe @isaac-chung 🙌

https://github.com/user-attachments/assets/746ae833-caba-42bd-a70e-c19ed557b442

orionw commented 1 month ago

Great explanation - And the current MegaWika is also worse?

@Muennighoff yeah unfortunately v1 was when we realized that a standard scrape didn't preserve the hierarchical structure of sections. Sam (the first author) has spend a lot of time getting it ready, although also adding a lot of things that not helpful to us here, like better citation scraping and better multilingual Wikipedia support.

maybe let's reduce to 200 words, what do you think?

I'll start one processing with 200 words and upload it.

I would probably either wait for MegaWiki v2 or use KILT

I'll also upload a version of KILT we can look at in the HF viewer for comparison.

Which seems like a faulty scrape/parsing. This makes it hard to reason about the quality of the retrieval as the document itself seems to be low quality.

@KennethEnevoldsen do you mind clarifying? Are you referring to sections that are empty? Those are due to stripping out tables since tables look really weird in text Wikipedia form (they have a fun Wiki table template). I'm assuming we don't want to deal with table reasoning for this project.

image
orionw commented 1 month ago

Sorry, coming back to the StackExchange now:

So the Dolma one (taken from RedPajamas here) looks like:

{
    "added": "2023-04-23T08:55:11.345Z",
    "attributes": {
        "paloma_paragraphs": []
    },
    "created": "2023-03-29T00:00:00.000Z",
    "id": "d723829ebaa768d11aadf99f01fe9b198950cb1f",
    "metadata": {
        "language": "en",
        "length": 1183,
        "provenance": "stackexchange_00000.jsonl.gz:1",
        "question_score": "16",
        "source": "stackexchange",
        "timestamp": "2023-03-29",
        "url": "https://apple.stackexchange.com/questions/1"
    },
    "source": "redpajama/stackexchange",
    "text": "Q: What is the difference between Intel and PPC? What is the hardware and software differences between Intel and PPC Macs?\n\nA: When it comes to Apple hardware, the differences between the last generation of PowerPC and the first generation of Intel were fairly minor, as far as the end user experience goes. They used the same form factors, and the all-new internals were quite effectively hidden by the unchanged exterior and the accommodations the operating system made for compatibility.\nThe last PowerPC Macs were sold in 2006, so any new machine since then is Intel.\nIn general, Intel Macs can run the vast majority of software created for PowerPC Macs. There is a performance hit for the emulation required, but it runs at acceptable speeds even for complex software like Photoshop. PowerPC Macs cannot run Intel software.\nThe latest version of OS X, Snow Leopard, is available only for Intel-based Macs.\nIntel Macs have access to a feature called Boot Camp, which allows them to boot into Windows at full speed. Intel Macs can also run Windows inside virtual machines with the help of third-party software (VMWare Fusion, VirtualBox or Parallels); there is a minor performance penalty for this, but it's much faster than the emulation required for a PowerPC Mac to run Windows software.\n\nA: The Intel chips at the time of the transition were sourced to be far more thermal and power efficient than the PPC chips of the time. Intel had much more room to grow within the same thermal and physical envelopes in terms of clock rate and the amount of hardware needed to support a given processor choice. \nThe PPC roadmap was shooting for massive clock rates in the 4 to 5 GHz range which amplified these disadvantages for future PPC chips when compared to future Intel chips.\nMoving to Intel processors did away with the need for exotic liquid cooling systems, massive heat sink design and complexity due to space constriants that went into the G5 PowerMac. Power supplies were also downsized.\nPPC design was heading directly into mainframe territory with chipkill memory, CPU virtualization, First Failure Data Capture and other high end / high cost features. Just check out this P5 heat sink and 4 processor MPM with associated L3 cache chips to get a feeling for how massive these processors would grow before Power7 manufacturing finally packed more power in a lower clock rate / smaller package. (and this is finally shipping in 2010). Now the Power5 and Power6 are still shipping and awesome at what they do in server land, just not so appropriate for the current Mac market space.\nFurthermore, there was nothing coming in the pipeline for a portable processor from PPC so even though the power was there for future desktop machines if one accepts the many tradeoffs already listed. Quite simply, portable macs were starving for horsepower on the PPC architecture and likely drove the urgency of a transition to anything but PPC.\n\nA: Hardware-wise: PowerPC is a microprocessor developed mainly by the three developing companies Apple, IBM, and Motorola. It is built with reduced instruction-set computer (RISC) which speeds-up the operation of MIPS (million instructions per second). PowerPC is mainly based on IBM’s earlier Power architecture because it has a similar RISC instruction set for microprocessors.\nIntel and AMD CPU's are based on CISC architectures. Typically CISC chips have a large amount of different and complex instructions. The philosophy behind it is that hardware is always faster than software, therefore one should make a powerful instructionset, which provides programmers with assembly instructions to do a lot with short programs.\nIn common CISC chips are relatively slow (compared to RISC chips) per instruction, but use little (less than RISC) instruction\n\nA: PPC Macs refers to the generation of Macintosh computers created in the mid to late 1990s through to 2006 that used PowerPC RISC based chips made by IBM or Motorola. That last PowerPC based Macintosh, the PowerMac G5 stopped being sold in August 2006. The latest version of Mac OS X a PowerPC chip enabled computer was able to run was Mac OS X 10.5 (Leopard) (so long as the computer supported it).\nIntel Macs refers to the newer Macintosh computers (since January 2006) that use Intel's CISC processors. Intel Macs uses EFI instead of BIOS and can run the latest versions of Mac OS X. Intel Macs are also able to run PowerPC compiled applications through a translation layer called Rosetta which is optionally installed in 10.6.\nIf a program is made available as a Universal binary it is able to run on both PPC and Intel Macs however many new applications released today are Intel only (eg. Google Chrome, Final Cut Studio, Mac OS X Snow Leopard).\n\nA: From the end user point of view, you don't need to worry about it much. Many applications were produced as \"universal\", meaning they run on both PPC and Intel-based Macs, and an emulator (called Rosetta) would let PPC-only apps run on the new Intel machines. \nHowever, as time passed, newer features were only available to Intel Macs, so some applications state outright that they require Intel chips. Also, the latest version of Mac OS X only runs on Intel CPUs.\nApple did a reasonably good job of hiding the entire transition from users, so that everything just kept working as people expected, offloading any heavy lifting to software developers.\n\nA: Architecture:\nPowerPC: (short for Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) and Intel processor.\nmore information can be found at wikipedia: PowerPC\n\nA: I also wanted to know more on the Power architecture, I did find some good info on it.  I'm glad to share the following information, specially for POWER8 (the latest from IBM):\n\n\n*\n\n*SMT8: 8 threads per core\n\n\n*\n\n*can also switch mode e.g. SMT1, SMT2, SMT4, SMT8 \n\n\n*CAPI: Coherent Accelerator Processor Interface\n\n\n*\n\n*first of its kind in industry\n\n*hardware attachment\n\n*eliminates the Device driver overhead when accessing the FPGA.\n\n*Increased coherency \n\n\n*NUCA - Non Uniform Cache Access\n\n\n*\n\n*though each processor is associated with a L3 cache, NUCA let's the L3 Cache be shared by the cores.\n\n*Benefits data-intensive workloads\n\n\n*NVIDIA partnership:\n\n\n*\n\n*through NVIDIA CUDA parallel computing we can obtain an 8x performance increase for Java programs, on Power8. \n\n\n\nMore references:\n\n\n*\n\n*https://community.runabove.com/kb/en/instances/power8-features.html\n\n*https://www.researchgate.net/publication/273393397_The_cache_and_memory_subsystems_of_the_IBM_POWER8_processor\n\nA: One thing I know is that PPCs are big endian by default, but can switch modes if necessary. Intel are little endian.\n\nA: Power PC has its unique set of instruction in which overall is labeled RISC architecture and the way it performs its program goes way faster than that used on PC. About software there isn't difference except the way it was coded or compiled. For example Windows NT 3.51 was developed for PowerPC.\nPC most used processor are labeled CISC architecture which change the way you code and the advantage is operates more than a single task at same time.\nThe term RISC and CISC doesn't make difference since some times RISC 32bits has more complex instructions than CISC 8bits.    \n",
    "version": "v1"
}

Also if we could add the Exchange that each sample is from that'd be great I think. Could put it in the title or sth; would likely improve results a lot.

We can grab the apple stackexchange portion from the url metadata. Where would we put it though? 🤔 At the very beginning, something like "Apple Stackechange\n\nQ..."? @Muennighoff

We can also definitely just take the first answer (with the assumption that was the best one) agree that will be better and keep the text more concise.

Also it seems like there are no titles for the Stack Exchange corpus --- Shouldn't we be able to get the question titles?

What do you mean by this @Muennighoff? Isn't that the "Q: What is the difference between Intel and PPC? What is the hardware and software differences between Intel and PPC Macs?" portion?

Muennighoff commented 1 month ago

We can grab the apple stackexchange portion from the url metadata. Where would we put it though? 🤔 At the very beginning, something like "Apple Stackechange\n\nQ..."?

Yes I think that'd be good!

We can also definitely just take the first answer (with the assumption that was the best one) agree that will be better and keep the text more concise.

Yes that's probably enough information, but maybe let's make it a separate column in the HF dataset so we can still revert to the older one if we want to (without having to recreate the HF dataset).

What do you mean by this @Muennighoff? Isn't that the "Q: What is the difference between Intel and PPC? What is the hardware and software differences between Intel and PPC Macs?" portion?

Oh yes seems like they just concatenate the question title and the question text. It may look a bit nicer if we could separate them but else also fine as is.

orionw commented 1 month ago

Yes that's probably enough information, but maybe let's make it a separate column in the HF dataset so we can still revert to the older one if we want to (without having to recreate the HF dataset).

How does this stackexchange one look for the text column? It's still really large (14M), even filtering out ones > 200 words.

We could filter based on the stackexchange source?

{
    "stackoverflow": 11334721,
    "math": 912634,
    "ru": 297108,
    "superuser": 278189,
    "askubuntu": 239338,
    "serverfault": 148260,
    "tex": 111470,
    "unix": 107464,
    "gis": 84022,
    "physics": 80723,
    "stats": 79967,
    "english": 79519,
    "apple": 75655,
    "magento": 74812,
    "es": 74266,
    "salesforce": 68298,
    "electronics": 65666,
    "pt": 64893,
    "sharepoint": 64793,
    "gaming": 64460,
    "ell": 63984,
    "meta": 61464,
    "mathoverflow": 58416,
    "wordpress": 56990,
    "mathematica": 39308,
    "dba": 32919,
    "scifi": 23196,
    "softwareengineering": 10906
}

For the Wiki apparently it's a lot slower to process smaller chunks so the 200 word version of what I had before is almost done, but the KILT version looks like this. It has some weird quirks (section and bullet tags like BULLET::::-) but we can easily replace those.

The cons to using this dataset are:

The pros are:

Thoughts on this tradeoff @Muennighoff @KennethEnevoldsen?

I'll post the 200 word version that's from the most recent Wikipedia using our creation method tomorrow when it's done running overnight.

Muennighoff commented 1 month ago

Amazing work!

StackExchange:

Wiki: Let's not go with KILT then I'd say - it's too much part of existing model's training data + outdated + badly formatted? Let's see how the 200-word version is! We can also use the existing ones and just cut-off after 200 words in the UI or something.

KennethEnevoldsen commented 1 month ago

@KennethEnevoldsen do you mind clarifying? Are you referring to sections that are empty? Those are due to stripping out tables since tables look really weird in text Wikipedia form (they have a fun Wiki table template). I'm assuming we don't want to deal with table reasoning for this project.

ahh right. Yea, those were essentially the problem. I am not sure how to best fix that issue though. I don't assume the viewer support tables? Another option would be to just remove the whole section with the table (I am not sure how that influences the content)

Thoughts on this tradeoff @Muennighoff @KennethEnevoldsen?

Hmm this seems hard to expand upon so I agree with @Muennighoff let's not go with KILT

orionw commented 1 month ago

Let's remove ru, es, pt Let's also filter based on characters? I found the below sample in the data with <200 words if you filter based on .split(' ') but 40K characters

Done but unfortunately it's still ~14M passages. Is that too many @Muennighoff @isaac-chung? I can also downsample, perhaps sample 2M of the 11.3M stackexchanges ones and keep the rest for a total of 4-5M. Or other downsampling strategies someone prefers.


For Wikipedia here's the new version with 200 words for a total of 16M passages. I did some postprocessing to move short sentences over to increase the probability of headings going to the right place, but it does make some of them larger than 200 words (~250 ish).

I'll create a PR with these changes once we are done iterating.

Muennighoff commented 1 month ago

I can also downsample, perhaps sample 2M of the 11.3M stackexchanges ones and keep the rest for a total of 4-5M. Or other downsampling strategies someone prefers.

Sounds good, how about we select the top ~4M across all according to the question score? Maybe setting a threshold of say question score bigger than 0 or bigger than 5 would be enough? I think this would largely remove samples that are uninteresting anyways & would hardly ever get retrieved, as they are probably not interesting to many people or poor questions. A total of 4-5M sounds good to me then it would be ~2x as big as Wikipedia/arXiv which is doable.

For Wikipedia [here's the new version with 200 words (https://huggingface.co/datasets/orionweller/wikipedia_0715_200_word)

Looks great. The main downside is that the indices will be twice as expensive but I think it is fine. I will create the indices 👍

Muennighoff commented 1 month ago

Investigated some samples of the new wikipedia dump - I think it's probably better but curious if @orionw also thinks so.

Example 1

400 word:

Hercule Poirot (, ) is a fictional Belgian detective created by British writer Agatha Christie. Poirot is one of Christie's most famous and long-running characters, appearing in 33 novels, two plays (Black Coffee and Alibi), and 51 short stories published between 1920 and 1975. Poirot has been portrayed on radio, in film and on television by various actors, including Austin Trevor, John Moffatt, Albert Finney, Peter Ustinov, Ian Holm, Tony Randall, Alfred Molina, Orson Welles, David Suchet, Kenneth Branagh, and John Malkovich. 
Overview 
Influences 
Poirot's name was derived from two other fictional detectives of the time: Marie Belloc Lowndes' Hercule Popeau and Frank Howel Evans' Monsieur Poiret, a retired French police officer living in London. Evans' Jules Poiret "was small and rather heavyset, hardly more than five feet, but moved with his head held high. The most remarkable features of his head were the stiff military moustache. His apparel was neat to perfection, a little quaint and frankly dandified." He was accompanied by Captain Harry Haven, who had returned to London from a Colombian business venture ended by a civil war. A more obvious influence on the early Poirot stories is that of Arthur Conan Doyle. In An Autobiography, Christie states, "I was still writing in the Sherlock Holmes tradition – eccentric detective, stooge assistant, with a Lestrade-type Scotland Yard detective, Inspector Japp". Conan Doyle acknowledged basing his detective stories on the model of Edgar Allan Poe's C. Auguste Dupin and his anonymous narrator, and basing his character Sherlock Holmes on Joseph Bell, who in his use of "ratiocination" prefigured Poirot's reliance on his "little grey cells". Poirot also bears a striking resemblance to A. E. W. Mason's fictional detective Inspector Hanaud of the French Sûreté, who first appeared in the 1910 novel At the Villa Rose and predates the first Poirot novel by 10 years. Christie's Poirot was clearly the result of her early development of the detective in her first book, written in 1916 and published in 1920. The large number of refugees in the country who had fled the German invasion of Belgium in August to November 1914 served as a plausible explanation of why such a skilled detective would be available to solve mysteries at an English country house. At the time of Christie's writing, it was considered patriotic to express sympathy towards the Belgians, since the invasion of their country had constituted Britain's casus belli for entering World War I, and British wartime propaganda emphasised the "Rape of Belgium". Popularity Poirot first appeared in The Mysterious Affair at Styles, published in 1920, and exited in Curtain, published in 1975. Following the latter, Poirot was the only fictional character to receive an obituary on the front page of The New York Times.

200 word:

Hercule Poirot (, ) is a fictional Belgian detective created by British writer Agatha Christie. Poirot is one of Christie's most famous and long-running characters, appearing in 33 novels, two plays (Black Coffee and Alibi), and 51 short stories published between 1920 and 1975. Poirot has been portrayed on radio, in film and on television by various actors, including Austin Trevor, John Moffatt, Albert Finney, Peter Ustinov, Ian Holm, Tony Randall, Alfred Molina, Orson Welles, David Suchet, Kenneth Branagh, and John Malkovich. 

Overview 

Influences 

Poirot's name was derived from two other fictional detectives of the time: Marie Belloc Lowndes' Hercule Popeau and Frank Howel Evans' Monsieur Poiret, a retired French police officer living in London. Evans' Jules Poiret "was small and rather heavyset, hardly more than five feet, but moved with his head held high. The most remarkable features of his head were the stiff military moustache. His apparel was neat to perfection, a little quaint and frankly dandified." He was accompanied by Captain Harry Haven, who had returned to London from a Colombian business venture ended by a civil war.

&

Christie's Poirot was clearly the result of her early development of the detective in her first book, written in 1916 and published in 1920. The large number of refugees in the country who had fled the German invasion of Belgium in August to November 1914 served as a plausible explanation of why such a skilled detective would be available to solve mysteries at an English country house. At the time of Christie's writing, it was considered patriotic to express sympathy towards the Belgians, since the invasion of their country had constituted Britain's casus belli for entering World War I, and British wartime propaganda emphasised the "Rape of Belgium". Popularity Poirot first appeared in The Mysterious Affair at Styles, published in 1920, and exited in Curtain, published in 1975. Following the latter, Poirot was the only fictional character to receive an obituary on the front page of The New York Times.

but the middle paragraph (A more obvious...) somehow got deleted and is not in the new dataset anymore. It is on the Wikipedia page though (https://en.wikipedia.org/wiki/Hercule_Poirot#:~:text=A%20more%20obvious%20influence%20on,Yard%20detective%2C%20Inspector%20Japp%22.).

Example 2

400 word:

Appearance and proclivities
Captain Arthur Hastings's first description of Poirot: Agatha Christie's initial description of Poirot in Murder on the Orient Express: In the later books, his limp is not mentioned, suggesting it may have been a temporary wartime injury. (In Curtain, Poirot admits he was wounded when he first came to England.) Poirot has green eyes that are repeatedly described as shining "like a cat's" when he is struck by a clever idea, and dark hair, which he dyes later in life. In Curtain, he admits to Hastings that he wears a wig and a false moustache. However, in many of his screen incarnations, he is bald or balding. Frequent mention is made of his patent leather shoes, damage to which is frequently a source of misery for him, but comical for the reader. Poirot's appearance, regarded as fastidious during his early career, later falls hopelessly out of fashion. Among Poirot's most significant personal attributes is the sensitivity of his stomach: He suffers from sea sickness, and, in Death in the Clouds, he states that his air sickness prevents him from being more alert at the time of the murder. Later in his life, we are told: Poirot is extremely punctual and carries a pocket watch almost to the end of his career. He is also particular about his personal finances, preferring to keep a bank balance of 444 pounds, 4 shillings, and 4 pence. Actor David Suchet, who portrayed Poirot on television, said "there's no question he's obsessive-compulsive". Film portrayer Kenneth Branagh said that he "enjoyed finding the sort of obsessive-compulsive" in Poirot. As mentioned in Curtain and The Clocks, he is fond of classical music, particularly Mozart and Bach.

Methods 

In The Mysterious Affair at Styles, Poirot operates as a fairly conventional, clue-based and logical detective; reflected in his vocabulary by two common phrases: his use of "the little grey cells" and "order and method". Hastings is irritated by the fact that Poirot sometimes conceals important details of his plans, as in The Big Four. In this novel, Hastings is kept in the dark throughout the climax. This aspect of Poirot is less evident in the later novels, partly because there is rarely a narrator to mislead. In Murder on the Links, still largely dependent on clues himself, Poirot mocks a rival "bloodhound" detective who focuses on the traditional trail of clues established in detective fiction (e.g., Sherlock Holmes depending on footprints, fingerprints, and cigar ash). From this point on, Poirot establishes his psychological bona fides. Rather than painstakingly examining crime scenes, he enquires into the nature of the victim or the psychology of the murderer. He predicates his actions in the later novels on his underlying assumption that particular crimes are committed by particular types of people.

200 word

Appearance and proclivities 
Captain Arthur Hastings's first description of Poirot: Agatha Christie's initial description of Poirot in Murder on the Orient Express: In the later books, his limp is not mentioned, suggesting it may have been a temporary wartime injury. (In Curtain, Poirot admits he was wounded when he first came to England.) Poirot has green eyes that are repeatedly described as shining "like a cat's" when he is struck by a clever idea, and dark hair, which he dyes later in life. In Curtain, he admits to Hastings that he has taken to wearing a wig and a false moustache. However, in many of his screen incarnations, he is bald or balding. Frequent mention is made of his patent leather shoes, damage to which is frequently a source of misery for him, but comical for the reader. Poirot's appearance, regarded as fastidious during his early career, later falls hopelessly out of fashion. Among Poirot's most significant personal attributes is the sensitivity of his stomach: He suffers from sea sickness, and, in Death in the Clouds, he states that his air sickness prevents him from being more alert at the time of the murder. Later in his life, we are told:

&

As mentioned in Curtain and The Clocks, he is fond of classical music, particularly Mozart and Bach. 

Methods 

In The Mysterious Affair at Styles, Poirot operates as a fairly conventional, clue-based and logical detective; reflected in his vocabulary by two common phrases: his use of "the little grey cells" and "order and method". Hastings is irritated by the fact that Poirot sometimes conceals important details of his plans, as in The Big Four. In this novel, Hastings is kept in the dark throughout the climax. This aspect of Poirot is less evident in the later novels, partly because there is rarely a narrator to mislead. In Murder on the Links, still largely dependent on clues himself, Poirot mocks a rival "bloodhound" detective who focuses on the traditional trail of clues established in detective fiction (e.g., Sherlock Holmes depending on footprints, fingerprints, and cigar ash). From this point on, Poirot establishes his psychological bona fides. Rather than painstakingly examining crime scenes, he enquires into the nature of the victim or the psychology of the murderer. He predicates his actions in the later novels on his underlying assumption that particular crimes are committed by particular types of people.

Example 3

400 word

"If I remember rightly – though my memory isn't what it was – you also had a brother called Achille, did you not?" Poirot's mind raced back over the details of Achille Poirot's career. Had all that really happened? "Only for a short space of time," he replied. Poirot is also willing to appear more foreign or vain in an effort to make people underestimate him. He admits as much: It is true that I can speak the exact, the idiomatic English. But, my friend, to speak the broken English is an enormous asset. It leads people to despise you. They say – a foreigner – he can't even speak English properly. ... Also I boast! An Englishman he says often, "A fellow who thinks as much of himself as that cannot be worth much." ... And so, you see, I put people off their guard. He also has a tendency to refer to himself in the third person. In later novels, Christie often uses the word mountebank when characters describe Poirot, showing that he has successfully passed himself off as a charlatan or fraud. Poirot's investigating techniques assist him solving cases; "For in the long run, either through a lie, or through truth, people were bound to give themselves away..." At the end, Poirot usually reveals his description of the sequence of events and his deductions to a room of suspects, often leading to the culprit's apprehension. Life Origins Christie was purposely vague about Poirot's origins, as he is thought to be an elderly man even in the early novels. In An Autobiography, she admitted that she already imagined him to be an old man in 1920. At the time, however, she did not know that she would write works featuring him for decades to come. A brief passage in The Big Four provides original information about Poirot's birth or at least childhood in or near the town of Spa, Belgium: "But we did not go into Spa itself. We left the main road and wound into the leafy fastnesses of the hills, till we reached a little hamlet and an isolated white villa high on the hillside." Christie strongly implies that this "quiet retreat in the Ardennes" near Spa is the location of the Poirot family home. An alternative tradition holds that Poirot was born in the village of Ellezelles (province of Hainaut, Belgium). A few memorials dedicated to Hercule Poirot can be seen in the centre of this village. There appears to be no reference to this in Christie's writings, but the town of Ellezelles cherishes a copy of Poirot's birth certificate in a local memorial 'attesting' Poirot's birth, naming his father and mother as Jules-Louis Poirot and Godelieve Poirot.

200 word

"If I remember rightly – though my memory isn't what it was – you also had a brother called Achille, did you not?" Poirot's mind raced back over the details of Achille Poirot's career. Had all that really happened? "Only for a short space of time," he replied. Poirot is also willing to appear more foreign or vain in an effort to make people underestimate him. He admits as much: It is true that I can speak the exact, the idiomatic English. But, my friend, to speak the broken English is an enormous asset. It leads people to despise you. They say – a foreigner – he can't even speak English properly. ... Also I boast! An Englishman he says often, "A fellow who thinks as much of himself as that cannot be worth much." ... And so, you see, I put people off their guard. He also has a tendency to refer to himself in the third person.

&

Poirot's investigating techniques assist him solving cases; "For in the long run, either through a lie, or through truth, people were bound to give themselves away..." At the end, Poirot usually reveals his description of the sequence of events and his deductions to a room of suspects, often leading to the culprit's apprehension. Life Origins Christie was purposely vague about Poirot's origins, as he is thought to be an elderly man even in the early novels. In An Autobiography, she admitted that she already imagined him to be an old man in 1920. At the time, however, she did not know that she would write works featuring him for decades to come. A brief passage in The Big Four provides original information about Poirot's birth or at least childhood in or near the town of Spa, Belgium: "But we did not go into Spa itself. We left the main road and wound into the leafy fastnesses of the hills, till we reached a little hamlet and an isolated white villa high on the hillside." Christie strongly implies that this "quiet retreat in the Ardennes" near Spa is the location of the Poirot family home.
Muennighoff commented 1 month ago

I just realized the new Wikipedia dataset has 6.415925213922872 times more rows - how come? If it's the same dump it should max double in size, no? I think you also took a more recent dump but Wikipedia doesn't grow 3x in a month I think. Is it because more passages are included now?

orionw commented 1 month ago

I just realized the new Wikipedia dataset has 6.415925213922872 times more rows - how come? If it's the same dump it should max double in size, no? I think you also took a more recent dump but Wikipedia doesn't grow 3x in a month I think. Is it because more passages are included now?

This is a good question. By default it's at least 2.5x more because 200 vs 500. Then some more is the packing: it accumulates passages until it gets to 200, so if it's at 100 and the next one is 150, it doesn't combine them. So in reality it's gonna be larger since the packing will make them average less than 200 words. I would've guessed beforehand 3-4x though so 6x is a bit weird. But I also don't see any repeats, which is also odd.

Investigated some samples of the new wikipedia dump - I think it's probably better but curious if @orionw also thinks so.

I like the 500 word ones for longer-doc search but I agree with your earlier comment that 200 words is more manageable for humans to read :) I'd go with 200 over 500.

The missing sections are a good catch... I think this has something to do with the packing code but am not sure. Perhaps I dropped sections in the 500 word version that are more often picked up in the 200? I will have to take a closer look later today.


For stackexchange at >= question score of 5 we get 1M docs, >= 3 is 2M, >=2 is 7M, and >= 1 is 10M. I'd say go for >= 3 but happy to go higher/lower! cc @Muennighoff

Muennighoff commented 1 month ago

Wikipedia

Makes sense let's go with the 200 then. I think there's no obvious way to filter it down as we don't have popularity metrics or similar - maybe we can afford the 6x increase.

StackExchange

Agreed, let's go with >=3 so 2M samples in total right? I.e. similar to arxiv

Muennighoff commented 1 month ago

There's no metadata or sth that we easily get for the Wikipedia corpus right? Maybe we could exclude some categories likely not interesting to our users

orionw commented 1 month ago

I think this then could be the final version of the Stackexchange one: https://huggingface.co/datasets/orionweller/stackexchange_200_words_2000_chars_en_only_3_score/settings

Has only 3+ question score, the one answer, and the subdomain prepended. There's no document more than ~2k chars or 200 words.

Muennighoff commented 1 month ago

I think for arXiv we could have probably only done CS papers to save some costs but don't want to reencode everything right now - maybe later if it becomes expensive

orionw commented 1 month ago

Okay here is the final Wikipedia version from 07/15/24 with 3,811,232 instances. It takes the top 500k Wikipedia articles by popularity and then chunks them into 200 words but allows them to be grouped together as long as they don't get too large (to avoid chunking paragraphs when possible and to preserve headings). It's also post-processed to remove any with long chars (3k+) but low words.

As an FYI, the top 500k articles account for ~74% of Wikipedia page views (1M is 81%, 750k is 79%, 250k is 64%).

Some stats:

### Chars
Mean: 1445.5656173646737
Median: 1436.0
25th: 1284.0
75th: 1644.0
90th: 1886.0
95th: 2048.0
99th: 2331.0
Max: 2996

### Words
Mean: 233.30970038034945
Median: 230.0
25th: 208.0
75th: 264.0
90th: 304.0
95th: 330.0
99th: 376.0
Max: 455
Muennighoff commented 1 month ago

Below some examples comparing before & after on the Wikipedia index with sentence-transformers/all-MiniLM-L6-v2

Before:

Screenshot 2024-07-23 at 2 25 52 PM

After:

Screenshot 2024-07-23 at 2 26 07 PM

Before:

Screenshot 2024-07-23 at 2 27 59 PM

After:

Screenshot 2024-07-23 at 2 28 22 PM

Before:

Screenshot 2024-07-23 at 2 31 36 PM

After

Screenshot 2024-07-23 at 2 31 05 PM

Some long ones are still in there as seen in the last two imgs as you allowed >200 words IIUC but maybe it's fine.

Muennighoff commented 1 month ago

What random samples would you use for StackExchange @orionw ? I think you suggested CQA previously, but does that make sense given many of the q's will be one-to-one in the dataset i.e. BM25 might win on all.

orionw commented 1 month ago

Hmm, I haven't seen much CQA comparisons with BM25 so I'm not sure if it always wins. If you're worried about overlap Lotte could be a good option.

I'm unsure if people will use the random sample button for the first example and then manually create their own queries, or whether they will continue to cycle through the samples. If they just use the first and move on to their own then it won't matter too much.

Muennighoff commented 1 month ago

Nice idea on LoTTe - Let's take all search queries from the pooled subset so 6.8K samples?

Also not sure - will be something we can inspect via the logs after a few days!

Muennighoff commented 1 month ago

Funny BM25 + StackExchange result:

Screenshot 2024-07-26 at 8 57 46 PM Screenshot 2024-07-26 at 8 58 13 PM

Not sure if users can perform some injection attack on the space? The environment has lots of API keys we don't want users to be able to retrieve but they shouldn't be able to as you cannot retrieve env variables only via markdown right?

Also a nice BM25 + StackExchange result:

Screenshot 2024-07-26 at 8 59 28 PM
KennethEnevoldsen commented 1 month ago

Is it executing arbitrary HTML?

Muennighoff commented 1 month ago

Our Arena (via GritLM) beating stackexchange site search 😂

Screenshot 2024-07-27 at 7 51 51 AM

mteb-arena