LAION-AI / project-menu

Projects at LAION
MIT License
10 stars 4 forks source link

Compute more stats on laion5b #26

Open rom1504 opened 2 years ago

robvanvolt commented 2 years ago

(*) are related, if the POS is not possible, then just the oxford word top list

Optional, but not necessary as we provide our own NSFW tagging:

robvanvolt commented 2 years ago
  1. For all authors...

    • [x] Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
    • [x] Did you describe the limitations of your work?
    • [x] Did you discuss any potential negative societal impacts of your work?
    • [x] Have you read the ethics review guidelines and ensured that your paper conforms to them? Ethics Guideline
  2. If you are including theoretical results...

    • [ ] Did you state the full set of assumptions of all theoretical results?
    • [ ] Did you include complete proofs of all theoretical results?
  3. If you ran experiments (e.g. for benchmarks)...

    • [ ] Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)?
    • [ ] Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?
    • [ ] Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)?
    • [ ] Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?
  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

    • [ ] If your work uses existing assets, did you cite the creators?
    • [ ] Did you mention the license of the assets?
    • [ ] Did you include any new assets either in the supplemental material or as a URL?
    • [ ] Did you discuss whether and how consent was obtained from people whose data you’re using/curating?
    • [x] Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?
  5. If you used crowdsourcing or conducted research with human subjects... (not applicable)

    • [x] Did you include the full text of instructions given to participants and screenshots, if applicable?
    • [x] Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
    • [x] Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?
Zasder3 commented 2 years ago
rom1504 commented 2 years ago

URL Domain frequency (% of total and top-10)

already did that, it's a very long tail, how can we extract information from that?

Zasder3 commented 2 years ago

If that's the result that's really all we have to say, worth just stating what % some top-k is for future users.

rom1504 commented 2 years ago

from cade "a scan for racial slurs and sexual keywords"

rom1504 commented 2 years ago

ok summary

what we already have:

what is left:

my opinion:

what I can compute on top: