JSON format for now and future

IntegersOfK commented 7 years ago

The JSON format is good but I'm thinking we don't want to have to keep rejigging our dataset for other use-cases as it starts to grow. I have a few thoughts.

We could switch the concept of "answers" with the word "responses" because not all the responses are the answer(s). It's just that the word 'answer' implies that it is the correct response.

I know it's hard to come up with a "difficulty" value for questions, but I don't hate the idea of a question difficulty rating so I'm throwing it out there.

I also wonder if it might work better like "correct_responses":[] and "false_responses":[] as lists independently, because then developers could provide more than one correct answer if that's their use-case. Ie. Select all the correct responses instead of select the single correct answer.

I think if developers are using this dataset in their app, they will call an entire chunk of JSON related to a single question and it's responses, a "Question.class" or whatever object they're initialising will probably in itself be a "Question" model. For this reason, the question text should be called "title" or some other distinguishing word. Plus a question like "Select the tallest mountain" isn't technically a question, it's just a statement.

So I guess I'm proposing something like this:

{
  "category_id": "Nature",
  "source": "http://example.com",
  "submitted": 1494700927,
  "difficulty": "1",
  "en": {
    "title": "Select all the animals",
    "correct": [
      "Cat",
      "Cow",
      "Chicken"
    ],
    "false": [
      "Car",
      "Bus",
      "Train"
    ]
  },
  "fr": {
    "title": "Selectez tous les animaux.",
    "correct": [
      "Chat",
      "Vache",
      "Poulet"
    ],
    "false": [
      "Auto",
      "Bus",
      "Train"
    ]
  }
}

Finally, I realise we're only starting here but I think we should always anticipate that we will have many languages and the dataset may always grow to have another language. After a while, on my international projects, I notice the top level ends up getting full of languages and I spend a lot of wasted time cycling through keys at the top level. For this reason, I would suggest nesting the language-specific content under another dictionary.

{
  "category_id": "Nature",
  "source": "http://example.com",
  "submitted": 1494700927,
  "difficulty": "1",
  "languages": {
    "en": {
      "title": "Select all the animals",
      "correct": [
        "Cat",
        "Cow",
        "Chicken"
      ],
      "false": [
        "Car",
        "Bus",
        "Train"
      ]
    },
    "fr": {
      "title": "Selectez tous les animaux.",
      "correct": [
        "Chat",
        "Vache",
        "Poulet"
      ],
      "false": [
        "Auto",
        "Bus",
        "Train"
      ]
    }
  }
}

IntegersOfK commented 7 years ago

Also I think we should include a schema version number.

(And possibly a checksum? But I can't quite figure out what we would want to sum and check.)

mtancoigne commented 7 years ago

Hi ! I like the multiple response possibilities, it really can improve the thing. For the "translations", I really think we should put them in another place, to create more appropriate sets.

And i'll add a "tags" list instead of a category_id. Maybe we can keep a main category, but tags will allow to fetch relevant subsets.

  {
    "id": "one-kind-of-uuid", // maybe a timestamp too
    "category_id": "Nature",  // main "tag" or category
    "source": "http://example.com",
    "updated": 1494700927, // Update date is more relevant
    "difficulty": "1", // Yup, hard to define...
    "question": "Select all the animals", // Maybe markdown here ?
    "correct": [
      "Cat",
      "Cow",
      "Chicken"
    ],
    "propositions": [
      "Car",
      "Bus",
      "Train"
    ],
    "tags": ["animals"]
  }

If you're ok with that, i'm going to convert all the present files to this format.

Oh, and as we're dealing with about 50 000 questions here, we really need to find a good data organisation; editing files with more than 500 different questions will be really boring and error prone. Do you know NPM a bit ? maybe we can create a quick tool to manage single files...

IntegersOfK commented 7 years ago

Ok that's a great synthesis of my thoughts, reconciled with your concerns too. I'd say we have a schema!

And after more reflection, it makes sense to keep languages separate, because users typically only want to play in/use one language, and these are filesets for people to download (as opposed to a database). And for that same reason (fileset) I'm still thinking that we need a field for meta properties or simply just a schema version in case we in the future or people today with their apps, want to update or expand the schema.

It's an interesting issue to me because with an API you can sort of deprecate an endpoint but still allow access or otherwise give some overlap in your older schema/format until all the developers have had a chance to update to the new API. But in this case, we've got these files and eventually there could be so many that we have each broken down into folders by categories and so on, to the point where it might not be realistic to expect a developer to have updated their app's schema. Maybe I'm just making a problem where there isn't one (yet)? But picture this case: A developer makes a trivia app with questions about his hometown, and it catches on, so he decides to expand his database of questions. He notices our open-source fileset and implements some questions from a few of our categories. He adopts our schema for his own personally-sourced questions about his hometown. Later, he comes to our site and notices that we have updated our schema with new properties... no problem, he can add a new module to deal with the new schema properties. The downside is that he now must convert his own schema into our new updated one, or otherwise so he might have to do a bunch of extra data management which we may have been able to save him by just including a schema version number. Anyway in the end maybe we can just defer that by adding it on version 2 (if we ever even feel the need to change the schema).

I'm not scared to dive into some sort of mini-framework for question management (and the gathering of new questions could be integrated eventually as well, instead of using Google Forms). At some point I think we're going to have to do a lot of verifying of questions, cleaning up, translation, etc... so we will want to make a "queue" and some sort of "Yes, I have vetted this question and it looks good" situation. Not to mention a process for regenerating the actual fileset after we make said changes.

mtancoigne commented 7 years ago

So maybe we can add a schema_version attribute, that makes sense. Event if it does not evolve, at least it will be present if we change our mind.

About a backend to manage/edit/(vote ?) for the questions, I don't know how you want to do it, but that's a good idea. I was thinking about a "standard" mysql (or whatever) db to save the questions and all the related data (votes, people submitting, etc...), with a JSON export module to export only the data we need here in the repo, so we can make updates from time to time here, exporting files from the backend.

Technically speaking, I'm good with PHP/JS; I tried SailsJS as an API server, coupled with VueJS for the client. It works well for small apps like what we're planning to do.

I propose an app with the following capacities:

Anonymous submitting/browsing
Registered side with:
- Votes
- Corrections
- Validation/rejection (maybe available with a certain number of votes, or manually by some admins...)

Validated questions would be included to the github repo.

Last Schema proposition, to be sure:

  {
    "schema_version":1,
    "id": "one-kind-of-uuid", // maybe a timestamp too
    "category_id": "Nature",  // main "tag" or category
    "source": "http://example.com",
    "updated": 1494700927, // Update date is more relevant
    "difficulty": "1", // Yup, hard to define...
    "question": "Select all the animals", // Maybe markdown here ?
    "correct": [
      "Cat",
      "Cow",
      "Chicken"
    ],
    "propositions": [
      "Car",
      "Bus",
      "Train"
    ],
    "tags": ["animals"]
  }

IntegersOfK commented 7 years ago

+1 for the schema, thank you for humouring my schema_version anxiety.

I have most experience with AngularJS, but I don't really feel secure enough to dictate the "best" architecture. I guess we just kind of have to start with something. I don't think the future of the web is PHP and if we intend to recruit project members we probably have to suck up the learning curve or figure out the best thing to let us iterate/prototype.

I use Python on Google App Engine for pretty much everything on my projects so far, and Cloud Storage seems like an appropriate candidate for these filesets. But then again, Github is perfectly capable of storing text and almost suits it's nature. Maybe version one is literally just pull requests or issues tagged with "trivia" or something, and we could make a website with a forum which submits to the Github API. I don't know, it's just a thought, maybe that's complex.

mtancoigne commented 7 years ago

I think the github repo should only keep clean records, so we should have a transitional "state", as a platform for submission and review. Then, the repo should be updated with these records sometime.

About the dev stack, php isn't dead yet :) but i don't really care about the language as long as it works and it's easy to setup. I don't know go, and I have a very basic knowledge of python, but it's ok for me if you want to use it.

mtancoigne commented 7 years ago

By the way, I wont update the files for now, I think we should have the structure before doing anything... If we just "Start with something", I propose a simple CakePHP website: it should be set up in a day. I'll try test it and put it on a vps for testing, during this week.

mtancoigne commented 7 years ago

Hi !

I'm coming back after these few monthes with some questions. I'm starting to design the database for the website, and i don't know if we should store the propositions in a separate table, or in the question itself: selection_091

IntegersOfK commented 7 years ago

I don't know the correct answer but I will offer thoughts. I think it's fair to wait before updating the files.

Do we need more language handling here? I just want to make sure the same question in various languages can be connected.

About your question, I don't see any reason propositions couldn't be in the question itself, because it can't really be re-used. Can you think of a case where somebody might want to use the same propositions for different questions? On the other hand, speaking to my language concern above maybe it makes sense to keep it and move the language CHAR(3) from questions into the answer, and propositions tables so they can be selected between, while adding another table for the question text itself.

mtancoigne commented 7 years ago

That's interesting; setting the language in the answers and propositions makes sense. For the additional question table, I don't know if it's really relevant to have a question translated. My first thought was to have questions in many languages, but no relation between them, as in totally different sets. Your proposition makes sense, but when I think about specific types of questions, as riddles for example, they tend to be badly translatable...

selection_102

mtancoigne commented 7 years ago

And by the way, I present you @SundayPlayer, a friend who might be interested by this project.

el-cms / Open-trivia-database

JSON format for now and future #1