ProjetPP / PPP-QuestionParsing-Grammatical

Question Parsing module for the PPP using a grammatical approch
GNU Affero General Public License v3.0
33 stars 11 forks source link

Build pickle objects on install #132

Closed progval closed 9 years ago

progval commented 9 years ago

As you may know, I'm really not a big fan of binary data in git repositories.

If I am right, you use pickle objects to make the module load faster. What if you had a Python file in the source code, which was used to produce a pickle file when installing the module?

If this is a feature you want, I can do it.

Ezibenroc commented 9 years ago

If I am right, you use pickle objects to make the module load faster.

No. We juste use pickle because this takes less memory, but I am not sure this is a good reason. An alternative would be to store all this in plain text.

Keep in mind that these are not small maps. The biggest one has 6951 keys, and its values are lists of strings (146602 in total).

progval commented 9 years ago

Wait… it takes less memory than what?

Ezibenroc commented 9 years ago

Than storing it in plain text.

We can generate automatically the big map (nounificationAuto.pickle), but this will add additional dependencies to the module (conceptnet5), and will increase a lot the installation time (several hours).

The small map (nounificationManual.pickle) is hand made.

progval commented 9 years ago

I don't understand how the way of storing it changes the memory usage after loading it… (and destroying the objects used while loading).

Maybe we could build only the small map on install?

Ezibenroc commented 9 years ago

I don't understand how the way of storing it changes the memory usage after loading it… (and destroying the objects used while loading).

I was not speaking about RAM, but disk storage. But I don't think it is relevant.

Maybe we could build only the small map on install?

You cannot build it automatically, the map has to be stored somehow (we added manually each word that this map contains).

progval commented 9 years ago

You cannot build it automatically, the map has to be stored somehow (we added manually each word that this map contains).

I meant storing it in plain text and building the map from it

Ezibenroc commented 9 years ago

We could store it in JSON format, and use json.load and json.dump instead of pickle.load and pickle.dump. Would it be ok?

Should we do the same for the big map? Or should we keep it in pickle?

progval commented 9 years ago

ok for json for the small map. (one line per entry, so diffs will be easier to read)

What would be the size of the big map if you used json?

yhamoudi commented 9 years ago

If I am right, you use pickle objects to make the module load faster.

No. We juste use pickle because this takes less memory, but I am not sure this is a good reason. An alternative would be to store all this in plain text.

According to me, it was to load the file faster.

The biggest one has 6951 keys, and its values are lists of strings (146602 in total).

Now, we have 2 big maps. One for nounification, and another one that associates to each verb its past participle (8462 keys). These 2 maps are seldom changed.

The 3rd one is manualNounification. It's small actually but it is changed very often. It would be very useful for us if we can store it in a readable way, for instance (in a file *.txt):

elect -> election
elect <- elector, voter, election
live in <- inhabitant, resident
...

Then the file is parsed and the map is automatically built at installation. Is is easy to do?

Moreover, we should be able to test our code without re-installing the module everytime.

Ezibenroc commented 9 years ago

What would be the size of the big map if you used json?

3.7 MB

progval commented 9 years ago

just 1MB more than in pickle, then. If you are sure you will never edit nounificationAuto.pickle, then it's useless to use JSON instead (it would add 3.7MB to the Git history); but otherwise I think it's better to switch as soon as possible. Or we could agree that you will switch it to JSON when you will have to edit it

Ezibenroc commented 9 years ago

It would be very useful for us if we can store it in a readable way

This is why Nounificator has a method __str__.

If you want to avoir typing some commands each time you want to see the map, add this to your file ~/.bash_aliases

alias printmap="python3 -c \"from nounDB import * ; n=Nounificator() ; n.load('/path/to/your/folder/PPP-QuestionParsing-Grammatical/ppp_questionparsing_grammatical/data/nounificationManual.json') ; print(n)\""

Then, just type printmap.

yhamoudi commented 9 years ago

This is why Nounificator has a method str.

It's not to see it but to update it. If you want to add an entry into the *.json file, you need to indent your text, add quotation mark, add brackets, create a new line for each word, ...

Ezibenroc commented 9 years ago

just 1MB more than in pickle, then. If you are sure you will never edit nounificationAuto.pickle, then it's useless to use JSON instead (it would add 3.7MB to the Git history); but otherwise I think it's better to switch as soon as possible. Or we could agree that you will switch it to JSON when you will have to edit it

133 modifies it.

It was dumped as:

pickle.dump(a, f)
pickle.dump(b, f)

Now it will be:

pickle.dump([a, b], f)

The reason is that json did not like similar dump and load.

Ezibenroc commented 9 years ago

It's not to see it but to update it. If you want to add an entry into the *.json file, you need to indent your text, add quotation mark, add brackets, create a new line for each word, ...

We have methods for this, please do not edit the file manually. Again, if you just want to avoid writing commands, just do a little script (something like ./addDirect write author).

yhamoudi commented 9 years ago

we change the format so that git "understands" the updates. Why couldn't we adopt a format that allows us to update the file manually (it will always be easier/quicker than with command line)?

progval commented 9 years ago

The reason is that json did not like similar dump and load.

(this is called a framed protocol, by the way)

Ezibenroc commented 9 years ago

we change the format so that git "understands" the updates. Why couldn't we adopt a format that allows us to update the file manually (it will always be easier/quicker than with command line)?

It is better to use standard things when they exist and are good. You could say the same thing for our datamodel. Why do we use:

{
    "type": "triple",
    "subject": {"type": "resource", "value": "George Washington"},
    "predicate": {"type": "resource", "value": "birth date"},
    "object": {"type": "missing"}
}

Instead of:

(George Washington, birth date, ?)

I think we should keep JSON. If you want, we can do a little script which takes a file written in your way and merge it to the database.

Deletions/editions rarely occure, it won't be too painfull to use the remove method of Nounificator.

yhamoudi commented 9 years ago

It is better to use standard things when they exist and are good.

The "standard thing" (for us) is the map stored into the pickle file at the end, not the *.json (it's another problem if someone wants to have a *.json from the map, we could add an export method into nounDB).

Here we are adding a step to simplify the way we store the data: instead of having directly the *.pickle we build a *.json that is converted afterward. Why don't we simplify the procedure the most we can with a *.txt?

If you want, we can do a little script which takes a file written in your way and merge it to the database.

Yes, but really: why an intermediate representation with json ? Why should we have txt -> json -> pickle instead of txt -> pickle? We agree that it is not possible to build a map of thousands keys (nounificationManual) using functions (add, remove, ...) instead of just filling up a txt file? If the only way we update nounificationManual is by merging a *.txt file with a *.json, we could directly use the txt.

You could say the same thing for our datamodel.

It's not the same. The normal forms are exchanged/manipulated between several modules. Here we directly convert the data (json or txt) into pickle, and then we manipulate it.

Ezibenroc commented 9 years ago

Did you read what is done in #133? We used to have pickle → python object. No we have pickle → python object AND JSON → python object. There is no JSON → pickle (unless if you do n.load('foo.json') and then n.save('foo.pickle')).

It's not the same. The normal forms are exchanged/manipulated between several modules. Here we directly convert the data (json or txt) into pickle, and then we manipulate it.

Other modules could be interested by our maps.

yhamoudi commented 9 years ago

yes, replace pickle by python map in the previous post :) Our nounificationManual map will be in json format (converted into python map) instead of pickle. I'm just saying that the main advantage of this is for git and not for us. Since we change nounDB to store the data into a more readable way, we could directly adopt the most easier way to update the data.

Other modules could be interested by our maps.

We could provide our maps into *.pickle and *.json formats, frequently updated. The *.txt file contains the data manually written, and each week for instance we export a *.pickle and *.json file from it (using the save method of nounDB).

Ezibenroc commented 9 years ago

Ok. I let you update #133 to handle your file format.

I suggest you do a class with the same behaviour as the modules json and pickle, with dump and load methods. The use case would be something like this for Nounificator.load (similar for save):

    def load(self, fileName):
        """
            Load the database from the file of given name (pickle or json format).
        """
        fileExtension = os.path.splitext(fileName)[1][1:]
        if fileExtension in self.pickleExtension:
            f = open(fileName, 'rb')
            module = pickle
        elif fileExtension in self.jsonExtension:
            f = open(fileName, 'r')
            module = json
        elif fileExtension in self.textExtension:
            f = open(fileName, 'r')
            module = YourSuperClassName
        [self.verbToNounsDirect, self.verbToNounsInverse] = module.load(f)
        f.close()
yhamoudi commented 9 years ago

When we use a json/txt file (as here: https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/pull/133), the files won't be loaded everytime a query is made online?

Ezibenroc commented 9 years ago

When we use a json/txt file (as here: #133), the files won't be loaded everytime a query is made online?

No I think they will be loaded when we will start the module. If you want to be sure, do some print she we load the file...