Closed progval closed 9 years ago
If I am right, you use pickle objects to make the module load faster.
No. We juste use pickle because this takes less memory, but I am not sure this is a good reason. An alternative would be to store all this in plain text.
Keep in mind that these are not small maps. The biggest one has 6951 keys, and its values are lists of strings (146602 in total).
Wait… it takes less memory than what?
Than storing it in plain text.
We can generate automatically the big map (nounificationAuto.pickle
), but this will add additional dependencies to the module (conceptnet5), and will increase a lot the installation time (several hours).
The small map (nounificationManual.pickle
) is hand made.
I don't understand how the way of storing it changes the memory usage after loading it… (and destroying the objects used while loading).
Maybe we could build only the small map on install?
I don't understand how the way of storing it changes the memory usage after loading it… (and destroying the objects used while loading).
I was not speaking about RAM, but disk storage. But I don't think it is relevant.
Maybe we could build only the small map on install?
You cannot build it automatically, the map has to be stored somehow (we added manually each word that this map contains).
You cannot build it automatically, the map has to be stored somehow (we added manually each word that this map contains).
I meant storing it in plain text and building the map from it
We could store it in JSON format, and use json.load
and json.dump
instead of pickle.load
and pickle.dump
. Would it be ok?
Should we do the same for the big map? Or should we keep it in pickle?
ok for json for the small map. (one line per entry, so diffs will be easier to read)
What would be the size of the big map if you used json?
If I am right, you use pickle objects to make the module load faster.
No. We juste use pickle because this takes less memory, but I am not sure this is a good reason. An alternative would be to store all this in plain text.
According to me, it was to load the file faster.
The biggest one has 6951 keys, and its values are lists of strings (146602 in total).
Now, we have 2 big maps. One for nounification, and another one that associates to each verb its past participle (8462 keys). These 2 maps are seldom changed.
The 3rd one is manualNounification. It's small actually but it is changed very often. It would be very useful for us if we can store it in a readable way, for instance (in a file *.txt
):
elect -> election
elect <- elector, voter, election
live in <- inhabitant, resident
...
Then the file is parsed and the map is automatically built at installation. Is is easy to do?
Moreover, we should be able to test our code without re-installing the module everytime.
What would be the size of the big map if you used json?
3.7 MB
just 1MB more than in pickle, then. If you are sure you will never edit nounificationAuto.pickle, then it's useless to use JSON instead (it would add 3.7MB to the Git history); but otherwise I think it's better to switch as soon as possible. Or we could agree that you will switch it to JSON when you will have to edit it
It would be very useful for us if we can store it in a readable way
This is why Nounificator
has a method __str__
.
If you want to avoir typing some commands each time you want to see the map, add this to your file ~/.bash_aliases
alias printmap="python3 -c \"from nounDB import * ; n=Nounificator() ; n.load('/path/to/your/folder/PPP-QuestionParsing-Grammatical/ppp_questionparsing_grammatical/data/nounificationManual.json') ; print(n)\""
Then, just type printmap
.
This is why Nounificator has a method str.
It's not to see it but to update it. If you want to add an entry into the *.json file, you need to indent your text, add quotation mark, add brackets, create a new line for each word, ...
just 1MB more than in pickle, then. If you are sure you will never edit nounificationAuto.pickle, then it's useless to use JSON instead (it would add 3.7MB to the Git history); but otherwise I think it's better to switch as soon as possible. Or we could agree that you will switch it to JSON when you will have to edit it
It was dumped as:
pickle.dump(a, f)
pickle.dump(b, f)
Now it will be:
pickle.dump([a, b], f)
The reason is that json did not like similar dump and load.
It's not to see it but to update it. If you want to add an entry into the *.json file, you need to indent your text, add quotation mark, add brackets, create a new line for each word, ...
We have methods for this, please do not edit the file manually.
Again, if you just want to avoid writing commands, just do a little script (something like ./addDirect write author
).
we change the format so that git "understands" the updates. Why couldn't we adopt a format that allows us to update the file manually (it will always be easier/quicker than with command line)?
The reason is that json did not like similar dump and load.
(this is called a framed protocol, by the way)
we change the format so that git "understands" the updates. Why couldn't we adopt a format that allows us to update the file manually (it will always be easier/quicker than with command line)?
It is better to use standard things when they exist and are good. You could say the same thing for our datamodel. Why do we use:
{
"type": "triple",
"subject": {"type": "resource", "value": "George Washington"},
"predicate": {"type": "resource", "value": "birth date"},
"object": {"type": "missing"}
}
Instead of:
(George Washington, birth date, ?)
I think we should keep JSON. If you want, we can do a little script which takes a file written in your way and merge it to the database.
Deletions/editions rarely occure, it won't be too painfull to use the remove
method of Nounificator
.
It is better to use standard things when they exist and are good.
The "standard thing" (for us) is the map stored into the pickle file at the end, not the *.json
(it's another problem if someone wants to have a *.json
from the map, we could add an export
method into nounDB
).
Here we are adding a step to simplify the way we store the data: instead of having directly the *.pickle
we build a *.json
that is converted afterward. Why don't we simplify the procedure the most we can with a *.txt
?
If you want, we can do a little script which takes a file written in your way and merge it to the database.
Yes, but really: why an intermediate representation with json ? Why should we have txt -> json -> pickle
instead of txt -> pickle
? We agree that it is not possible to build a map of thousands keys (nounificationManual
) using functions (add, remove, ...) instead of just filling up a txt file? If the only way we update nounificationManual
is by merging a *.txt
file with a *.json
, we could directly use the txt.
You could say the same thing for our datamodel.
It's not the same. The normal forms are exchanged/manipulated between several modules. Here we directly convert the data (json or txt) into pickle, and then we manipulate it.
Did you read what is done in #133?
We used to have pickle → python object
.
No we have pickle → python object
AND JSON → python object
.
There is no JSON → pickle
(unless if you do n.load('foo.json')
and then n.save('foo.pickle')
).
It's not the same. The normal forms are exchanged/manipulated between several modules. Here we directly convert the data (json or txt) into pickle, and then we manipulate it.
Other modules could be interested by our maps.
yes, replace pickle
by python map in the previous post :) Our nounificationManual map will be in json format (converted into python map) instead of pickle. I'm just saying that the main advantage of this is for git and not for us. Since we change nounDB
to store the data into a more readable way, we could directly adopt the most easier way to update the data.
Other modules could be interested by our maps.
We could provide our maps into *.pickle
and *.json
formats, frequently updated. The *.txt
file contains the data manually written, and each week for instance we export a *.pickle
and *.json
file from it (using the save method of nounDB).
Ok. I let you update #133 to handle your file format.
I suggest you do a class with the same behaviour as the modules json and pickle, with dump
and load
methods. The use case would be something like this for Nounificator.load
(similar for save
):
def load(self, fileName):
"""
Load the database from the file of given name (pickle or json format).
"""
fileExtension = os.path.splitext(fileName)[1][1:]
if fileExtension in self.pickleExtension:
f = open(fileName, 'rb')
module = pickle
elif fileExtension in self.jsonExtension:
f = open(fileName, 'r')
module = json
elif fileExtension in self.textExtension:
f = open(fileName, 'r')
module = YourSuperClassName
[self.verbToNounsDirect, self.verbToNounsInverse] = module.load(f)
f.close()
When we use a json/txt file (as here: https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/pull/133), the files won't be loaded everytime a query is made online?
When we use a json/txt file (as here: #133), the files won't be loaded everytime a query is made online?
No I think they will be loaded when we will start the module. If you want to be sure, do some print she we load the file...
As you may know, I'm really not a big fan of binary data in git repositories.
If I am right, you use pickle objects to make the module load faster. What if you had a Python file in the source code, which was used to produce a pickle file when installing the module?
If this is a feature you want, I can do it.