cpinitiative / usaco-guide

A free collection of curated, high-quality resources to take you from Bronze to Platinum and beyond.
https://usaco.guide
Other
1.61k stars 485 forks source link

Automate USACO Problem Updates #4027

Closed thecodingwizard closed 9 months ago

thecodingwizard commented 11 months ago

Any time USACO releases new problems, the following things need to be done (potentially incomplete):

They're done manually right now, but we should figure out how to do it automatically (or at least make it easier to do everything manually).

SansPapyrus683 commented 11 months ago

image

danielzsh commented 10 months ago

How would full automation work? Would we have to continuously ping the USACO website to check for new problems/contests?

thecodingwizard commented 10 months ago

For now, a "get problems and update files" script would probably suffice? That way after every contest we only need one person to run a script and update everything.

danielzsh commented 10 months ago

Is access to the Algolia database public? It's separate from this repo right?

thecodingwizard commented 10 months ago

Access to algolia is not public, and is indeed separate from the repository.

However, the format in which we store algolia items is public: https://github.com/cpinitiative/usaco-guide/blob/62f87bb4bcecb2b67722dbc49a443465b1f68544/gatsby-config.ts#L115 It's just the API key that is private.

So if you are interested in working on this, I believe you should be able to test locally by making your own Algolia account and using your own API key!

danielzsh commented 10 months ago

On a semi-related note, is the local algolia search run during yarn develop synced with the one used by usaco.guide? When i search balancing a tree in problems search and click "view solution" on the actual usaco.guide website it redirects correctly to the internal sol, but it doesnt work on my local clone (it redirects to the old usaco external sol instead). image "View Solution" links to different pages on usaco.guide vs on a local build What's more peculiar is that the "View Solution" link works for most other internal sols locally, just not Balancing a Tree; maybe because the internal sol for this problem was created relatively recently?

thecodingwizard commented 10 months ago

Good observation! I don't think these are synced. I believe there is a "development" version of Algolia (so it's easier to test changes locally without screwing it up in production).

Changes are actually synced when the project is built and deployed (though sometimes this is a little finicky if I recall correctly)

danielzsh commented 10 months ago

Hm, what do you mean by "changes are synced"? Because when I run it locally even on the master branch the Algolia tables still don't match

thecodingwizard commented 10 months ago

There are two indexes in algolia, prod_problems and dev_problems. The actual website runs on prod_problems. Local development uses the dev_problems index (I think). This is probably why running locally yields different Algolia tables than production.

To develop locally, the easiest way is probably to make your own Algolia account and change the API key / index name to use your own Algolia setup. Then you can test changes / modify your own index without interfering with the production website.

When a PR gets merged / Vercel builds the site, the gatsby algolia plugin runs and updates the prod_problems index. That's what I meant by "changes are synced" -- if you modify the algolia schema / want to change an algolia index using the gatsby algolia plugin, the changes are only deployed to the production algolia index when the site is built on Vercel.

bqi343 commented 10 months ago

Then when is dev_problems updated?

thecodingwizard commented 10 months ago

If you have the API keys to CPI's algolia account, you can update dev_problems locally. I forget when exactly it is updated to be honest. I think dev_problems was used a lot when the problems search was being developed, but now it is not really used anymore since there isn't development on problems search.

danielzsh commented 10 months ago

side note: is there any specific reason why problem search is still in beta?

image
danielzsh commented 10 months ago

To develop locally, the easiest way is probably to make your own Algolia account and change the API key / index name to use your own Algolia setup. Then you can test changes / modify your own index without interfering with the production website.

will look into it πŸ‘πŸ»

thecodingwizard commented 10 months ago

I think there were a lot of other features we were thinking of adding to the problems search page, but we just never got around to it…

danielzsh commented 10 months ago

However, the format in which we store algolia items is public:

https://github.com/cpinitiative/usaco-guide/blob/62f87bb4bcecb2b67722dbc49a443465b1f68544/gatsby-config.ts#L115

It's just the API key that is private.

Interesting, so when we run the website locally without any additional configuration, where are these environment variables (ALGOLIA_APP_ID/ALGOLIA_API_KEY) coming from?

danielzsh commented 10 months ago

Also, I've created a .env file with the following content:

ALGOLIA_APP_ID=XXXXXX
ALGOLIA_API_KEY=XXXXXXXXX (Admin API Key)

and within the algolia project I've made an index called dev_problems, but I don't think its changed anything. Is there another step I missed?

update 1: forgot to run yarn build earlier but am now running into the following error:

  Error: Record at the position 86 objectID=intro-ds is too big size=15157/10000 bytes. Please have a look at https://
  www.algolia.com/doc/guides/sending-and-managing-data/prepare-your-data/in-depth/index-and-records-size-and-usage-lim
  itations/#record-size-limits

update 2: after setting ALGOLIA_INDEX_NAME to dev and NODE_ENV to development, I now somehow get a new error 😭

  Error: Error loading a result for the page query in "/problems/ccc-firehose/solution". Query was not run and no cach
  ed result was found.

update 3: error from update 2 has been fixed in #4084 πŸ™πŸ» still running into this error though

  Error: Record at the position 86 objectID=intro-ds is too big size=15157/10000 bytes. Please have a look at https://
  www.algolia.com/doc/guides/sending-and-managing-data/prepare-your-data/in-depth/index-and-records-size-and-usage-lim
  itations/#record-size-limits

do i need a paid plan to store the full database?

danielzsh commented 10 months ago

I think there were a lot of other features we were thinking of adding to the problems search page, but we just never got around to it…

Is working on them still a possibility :)

thecodingwizard commented 10 months ago

Oh yikes... we do have a paid plan which is probably why we didn't run into this issue before. But I'm not sure why the object is so big -- maybe we're storing some information we don't actually need to store in that object?

Is working on them still a possibility :)

I personally do not have plans to work on them, but if you happen to have time and want to, that would be much appreciated!!

thecodingwizard commented 10 months ago

Okay, after looking into it, I think it's because the module object in algolia contains the full text content of the module, which is very big for intro DS. Two solutions: either clip the content length to 9k characters (less ideal), or improve the way we extract text from the article (better). For example, quiz questions don't need to be extracted, code does not need to be extracted, etc.

Another approach would be to make every section in every module its own object which might improve search, but this would be harder.

To be honest, we should probably figure out a more ideal way to implement search; right now, the search quality isn't very good I think. The prod_modules index just powers the search for modules functionality on the website.


If you need access to our paid Algolia account and you're a CPI team member, let me know -- we may be able to set something up for you.

danielzsh commented 10 months ago

If you need access to our paid Algolia account and you're a CPI team member, let me know -- we may be able to set something up for you.

I am a team member but it's ok if that's too much of a hassle to set up πŸ˜…

To be honest, we should probably figure out a more ideal way to implement search; right now, the search quality isn't very good I think.

Is the metadata in the search results (id:, title:, etc.) intentional? it makes the search results look a bit less professional imo

image

Two solutions: either clip the content length to 9k characters (less ideal), or improve the way we extract text from the article (better).

Would definitely be nice to look into, although this probably isn't a priority since I think not many people plan on running their own local algolia clone anyway πŸ˜“

I personally do not have plans to work on them, but if you happen to have time and want to, that would be much appreciated!!

I might; is there a list of some of these planned features?

thecodingwizard commented 10 months ago

Is the metadata in the search results (id:, title:, etc.) intentional? it makes the search results look a bit less professional imo

no! it would be nice if we got rid of it.

Would definitely be nice to look into, although this probably isn't a priority since I think not many people plan on running their own local algolia clone anyway πŸ˜“

I think improving the way we extract text (ex. by getting rid of the metadata in your screenshot) would help with production search as well, not just local development.

I might; is there a list of some of these planned features?

86 lists some I think, but feel free to use your imagination :D

danielzsh commented 10 months ago

If you need access to our paid Algolia account and you're a CPI team member, let me know -- we may be able to set something up for you.

I am a team member but it's ok if that's too much of a hassle to set up πŸ˜…

also, is this still a possibility?

danielzsh commented 10 months ago

Also, I'm trying to use my own Algolia Client like so:

import algoliasearch from 'algoliasearch/lite';

export const searchClient = algoliasearch(
  process.env.ALGOLIA_APP_ID ?? '3CFULMFIDW',
  process.env.ALGOLIA_API_KEY ?? 'b1b046e97b39abe6c905e0ad1df08d9e'
);

(I'm using the ?? so it still works for people who don't have the env variables set) It works when I just directly do:

export const searchClient = algoliasearch(
  'my_app_id',
  'my_api_key'
);

However, the first snippet still defaults to the old values; are the env variables somehow unintialized when searchClient is initialized?

thecodingwizard commented 10 months ago

also, is this still a possibility?

yes, can you dm me on Discord? @thecodingwizard

are the env variables somehow unintialized when searchClient is initialized?

How are you setting the environment variables? I think if you put them in an .env file in the root directory it should work, but I could be wrong.

danielzsh commented 10 months ago

I do have them in a .env file, but its still not working? I tried putting require('dotenv').config() at the top of algoliaSearchClient.ts but I got this error:

BREAKING CHANGE: webpack < 5 used to include polyfills for node.js core modules by default.
This is no longer the case. Verify if you need this module and configure a polyfill for it.
thecodingwizard commented 10 months ago

Hm, sorry, I'm not actually sure what's wrong then...

danielzsh commented 9 months ago

Update: it somehow works now... https://github.com/cpinitiative/usaco-guide/commit/08bd022e1657d2260cfd613098a87a4949056414 in #4086 Although, the RefinementList no longer loads for me locally; is it dependent on the *_modules index?

thecodingwizard commented 9 months ago

You may have needed to restart yarn dev? Not sure.

I think you're correct that RefinementList is dependent on some Algolia configuration (my guess is *_problems). I attached exports of our configuration for *_modules and *_problems here

export-prod_modules-3CFULMFIDW-1702662779.json

export-prod_problems-3CFULMFIDW-1702662884.json

danielzsh commented 9 months ago

Hm, I suspect it's *_modules because I don't have the modules index on my own copy (due to reasons mentioned earlier of the content being too large) and that's likely why the Refinement List doesn't load? (and also the Refinement List categories are basically just the module names)

thecodingwizard commented 9 months ago

you can make a modules index, then populate it by running gatsby build I think!

danielzsh commented 9 months ago

you can make a modules index, then populate it by running gatsby build I think!

Yeah, although, as I don't have a paid account, I unfortunately run into this issue:

Okay, after looking into it, I think it's because the module object in algolia contains the full text content of the module, which is very big for intro DS. Two solutions: either clip the content length to 9k characters (less ideal), or improve the way we extract text from the article (better). For example, quiz questions don't need to be extracted, code does not need to be extracted, etc.

thecodingwizard commented 9 months ago

I would just clip the length for local development purposes for now…

danielzsh commented 9 months ago

edit: nvm, didn't notice the separate div_to_probs file

In https://github.com/cpinitiative/usaco-guide/blob/master/src/components/markdown/ProblemsList/DivisionList/DivisionList.tsx, the code for the monthlies table, the following code appears:

const data = useStaticQuery(graphql`
  query {
    allProblemInfo(
      filter: { source: { in: ["Bronze", "Silver", "Gold", "Plat"] } }
    ) {
      edges {
        node {
          solution {
            kind
            label
            labelTooltip
            sketch
            url
            hasHints
          }
          uniqueId
          url
          tags
          difficulty
          module {
            frontmatter {
              id
            }
          }
        }
      }
    }
  }
`);

When I run this query in graphiQL, problems such as Equal Sum Subarray, which isn't linked to a module, don't appear in the results (however, Piling Papers does, despite being more recent, because it's linked to a module). However, Equal Sum Subarray does show up in the monthlies table itself, which doesn't make a whole lot of sense to me; isn't all the content in the monthlies table extracted from this graphql query?

danielzsh commented 9 months ago

progress: in order to get usaco problems to show up in problems search, we just have to add them to extraProblems.json, so I wrote a script to do that: https://github.com/devo1ution/usaco-guide/blob/algolia/usaco_util.mjs It prompts for the problem id and generates the corresponding json by querying the problem page and adds it to extraProblems.json. JSON.stringify() puts all the array elements (tags) on different lines, but this gets fixed by pre-commit :) updated code to use prettier so this isn't even necessary anymore TODO:

danielzsh commented 9 months ago

@thecodingwizard not quite sure how this works, but if I queried the usaco website once for every possible problem id (~1500 times) to keep the table up to date would that overload the server 😭

thecodingwizard commented 9 months ago

LOL maybe we can figure out some more efficient solution. Perhaps we can let the user specify which monthly contests they would like to scrape, or we can intelligently scrape from the most recent contest to the oldest contest, stopping whenever we encounter a problem that we have already seen before.

If you're querying solely based off problem ID, maybe you can assume that they are chronologically increasing? I'm not entirely sure...

Thanks for all your work here! This will be super helpful for the upcoming contest :)

danielzsh commented 9 months ago

Just wrote a script that added all old problems to extraProblems.json if they weren't already there or in a module: https://github.com/cpinitiative/usaco-guide/pull/4086/commits/7f4baecbdf39585f401eafed8a9ce9354cd8c8a0 (although the difficulties will need to be manually tweaked) i also pushed these changes to dev_problems in case you want to mess around with them also edited usaco_util.mjs to prompt for difficulty too Although for whatever reason my code fails tsc status check now and I'm not sure why?

danielzsh commented 9 months ago

also minor change: I renamed ALGOLIA_APP_ID to GATSBY_ALGOLIA_APP_ID so it can be accessed in algoliaSearchClient; although I have provided default values in algoliaSearchClient so it shouldn't make too much of a difference.

thecodingwizard commented 9 months ago

(although the difficulties will need to be manually tweaked)

I wonder if it's possible to create a new difficulty value of "Unknown". might need to tweak a lot of the UI rendering stuff too though..

SansPapyrus683 commented 9 months ago

@thecodingwizard not quite sure how this works, but if I queried the usaco website once for every possible problem id (~1500 times) to keep the table up to date would that overload the server 😭

i mean just query the last couple? you can set a floor and query up from that every so often

danielzsh commented 9 months ago

@thecodingwizard side note: would it be possible to have pull request branches push to dev_problems instead of prod_problems and also use dev_problems? That way algolia updates are easier to preview

i mean just query the last couple? you can set a floor and query up from that every so often

yeah I think we can keep the latest problem id as a repo secret and then we can set up timed workflows like for each season (December/Jan/Feb/March 20th) so we can just incrementally update automatically

I wonder if it's possible to create a new difficulty value of "Unknown". might need to tweak a lot of the UI rendering stuff too though..

alr I added a new N/A difficulty class that shows a tooltip when you hover over it: This problem was added automatically; if you want to suggest a difficulty, feel free to make a pull request!

danielzsh commented 9 months ago

Also, I refactored the difficulty box (the little thing that says Easy/Hard/Insane) into a separate file but am now running into this warning:

warn chunk commons [mini-css-extract-plugin]
Conflicting order. Following module has been added:
 * css ./node_modules/gatsby/node_modules/css-loader/dist/cjs.js??ruleSet[1].rules[9].oneOf[1].use[1]!./node_modules/postcs
s-loader/dist/cjs.js??ruleSet[1].rules[9].oneOf[1].use[2]!./node_modules/tippy.js/themes/material.css
despite it was not able to fulfill desired ordering with these modules:
 * css ./node_modules/gatsby/node_modules/css-loader/dist/cjs.js??ruleSet[1].rules[9].oneOf[1].use[1]!./node_modules/postcs
s-loader/dist/cjs.js??ruleSet[1].rules[9].oneOf[1].use[2]!./node_modules/tippy.js/themes/light.css
   - couldn't fulfill desired order of chunk group(s) component---src-pages-problems-tsx,
component---src-pages-problems-tsxhead
   - while fulfilling desired order of chunk group(s) component---src-pages-dashboard-tsx,
component---src-templates-module-template-tsx, component---src-templates-solution-template-tsx,

Any idea how to resolve it?

edit: nvm, just had to alphabetically reorder the imports

thecodingwizard commented 9 months ago

side note: would it be possible to have pull request branches push to dev_problems instead of prod_problems and also use dev_problems? That way algolia updates are easier to preview

It seems like this should be possible, but I am not sure how. I think in Vercel you can set environment variables dependent on whether it is production or preview, so perhaps we can add an environment variable that specifies the algolia prefix to use. (I think we might already have an environment variable named ALGOLIA_INDEX_NAME)

danielzsh commented 9 months ago

It seems like this should be possible, but I am not sure how. I think in Vercel you can set environment variables dependent on whether it is production or preview, so perhaps we can add an environment variable that specifies the algolia prefix to use. (I think we might already have an environment variable named ALGOLIA_INDEX_NAME)

Could you try settingALGOLIA_INDEX_NAME to dev? Idt i have access to the vercel πŸ˜…

thecodingwizard commented 9 months ago

Done! Vercel needs to re-build before the changes will take effect (I triggered a manual rebuild for your Algolia PR).

danielzsh commented 9 months ago

Hm, it seems like the most recent vercel build is still using prod_problems: https://usaco-guide-ci2palbvi-cpinitiative.vercel.app/problems/ (e.g. try searching "fertilizing pastures"; it's present in dev_problems but doesn't show up here) Or am I misunderstanding what rebuild means πŸ˜… edit: oops, turns out the index doesn't depend on ALGOLIA_INDEX_NAME! In problems.tsx:

const indexName =
  process.env.NODE_ENV === 'production' ? 'prod_problems' : 'dev_problems';

can I change this to depend on ALGOLIA_INDEX_NAME instead? e.g.

const indexName = `${process.env.ALGOLIA_INDEX_NAME}_problems`;

edit 2: we also have to rename ALGOLIA_INDEX_NAME to GATSBY_ALGOLIA_INDEX_NAME so components can actually access it πŸ˜“

thecodingwizard commented 9 months ago

can I change this to depend on ALGOLIA_INDEX_NAME instead? e.g.

Yes, that makes sense! Though, maybe as a fallback (ie. if process.env.ALGOLIA_INDEX_NAME is not defined), default to what we had previously?

we also have to rename ALGOLIA_INDEX_NAME to GATSBY_ALGOLIA_INDEX_NAME so components can actually access it

oops sorry, why is this the case? (like why do we need the GATSBY_ prefix / what was the reasoning for renaming ALGOLIA_INDEX_NAME to GATSBY_ALGOLIA_INDEX_NAME?)

danielzsh commented 9 months ago

I have the index defaulted to dev_problems for now, but I can change that to prod if you want!

As for the env variables, t’s a bit obscured in the docs, but env variables without the Gatsby prefix can only be accessed in gatsby-config.ts (which is why my env variables weren’t working properly earlier I think).

On Fri, Dec 22, 2023 at 12:40 Nathan Wang @.***> wrote:

can I change this to depend on ALGOLIA_INDEX_NAME instead? e.g.

Yes, that makes sense! Though, maybe as a fallback (ie. if process.env.ALGOLIA_INDEX_NAME is not defined), default to what we had previously?

we also have to rename ALGOLIA_INDEX_NAME to GATSBY_ALGOLIA_INDEX_NAME so components can actually access it

oops sorry, why is this the case? (like why do we need the GATSBY_ prefix?)

β€” Reply to this email directly, view it on GitHub https://github.com/cpinitiative/usaco-guide/issues/4027#issuecomment-1868055106, or unsubscribe https://github.com/notifications/unsubscribe-auth/APTPJCR4X6CNWCNHUJ4XMGDYKXV2ZAVCNFSM6AAAAAA7HYSYVGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGA2TKMJQGY . You are receiving this because you commented.Message ID: @.***>

thecodingwizard commented 9 months ago

Oh wow, I had no idea. I added GATSBY_ALGOLIA_INDEX_NAME to both prod and dev, and triggered a rebuild for your branch.

Defaulting to dev_problems seems fine to me!

bqi343 commented 9 months ago

can the Dec 2023 problems be added?