dice-group / Palmetto

Palmetto is a quality measuring tool for topics
GNU Affero General Public License v3.0
213 stars 36 forks source link

Can not ensure order of items in /df service #11

Closed earthquakesan closed 7 years ago

earthquakesan commented 7 years ago

In case if one of the words submitted to /df service does not contain any documents, the item will be skipped in a query. This results in a wrong parsing of a request as the df does not contain NULL int (i.e. four empty bytes). For example:

(Pdb) palmetto.get_df_for_words(['glotto', 'isoexception'])
[('glotto', {1707408})]
(Pdb) palmetto.get_df_for_words(['isoexception', 'glotto'])
[('isoexception', {1707408})]
(Pdb) palmetto.get_df_for_words(['glotto'])
[('glotto', {1707408})]
(Pdb) palmetto.get_df_for_words(['isoexception'])
[]

Here, "glotto" has one document id "1707408" and "isoexception" has none. However, due to absense of NULL int, the bytestream parsed in a way that the first item is assigned the received document id and the second item is simply ignored (because the end of a stream).

earthquakesan commented 7 years ago

In a more complex example, it will lead to completely unexpected results. For instance, "familycolor" and "isoexception" have no document ids accociated. However, the results show that the following words contain document ids (it basically trims the last two items):

dict_keys(['type', 'national', 'subject', 'familycolor', 'fam', 'discipline', 'character', 'label', 'topic'])

The full example here:

(Pdb) words_does_not_work
['label', 'type', 'character', 'subject', 'discipline', 'topic', 'national', 'familycolor', 'fam', 'glotto', 'isoexception']
(Pdb) len(palmetto.get_df_for_words(['label'])[0][1])
116680
(Pdb) len(palmetto.get_df_for_words(['type'])[0][1])
210056
(Pdb) len(palmetto.get_df_for_words(['character'])[0][1])
223503
(Pdb) len(palmetto.get_df_for_words(['subject'])[0][1])
160247
(Pdb) len(palmetto.get_df_for_words(['discipline'])[0][1])
38882
(Pdb) len(palmetto.get_df_for_words(['topic'])[0][1])
59810
(Pdb) len(palmetto.get_df_for_words(['national'])[0][1])
749384
(Pdb) len(palmetto.get_df_for_words(['familycolor'])[0][1])
*** IndexError: list index out of range
(Pdb) len(palmetto.get_df_for_words(['fam'])[0][1])
922
(Pdb) len(palmetto.get_df_for_words(['glotto'])[0][1])
1
(Pdb) len(palmetto.get_df_for_words(['isoexception'])[0][1])
*** IndexError: list index out of range
(Pdb) dict(doc_ids).keys()
dict_keys(['type', 'national', 'subject', 'familycolor', 'fam', 'discipline', 'character', 'label', 'topic'])
MichaelRoeder commented 7 years ago
public static void main(String[] args) throws Exception {
        String url = "http://palmetto.aksw.org/palmetto-webapp/service/df?words=label%20type%20character%20subject%20discipline%20topic%20national%20familycolor%20fam%20glotto%20isoexception";
        URL obj = new URL(url);
        HttpURLConnection con = (HttpURLConnection) obj.openConnection();

        con.setRequestMethod("GET");
        System.out.println("Response Code : " + con.getResponseCode());

        InputStream is = new BufferedInputStream(con.getInputStream());
        ByteBuffer buffer;
        byte bytes[];
        int length;
        // Go through all words
        while (is.available() > 0) {
            bytes = new byte[4];
            is.read(bytes);
            buffer = ByteBuffer.wrap(bytes);
            length = buffer.getInt();
            // print the length
            System.out.println(length);
            // skip the data
            for (int i = 0; i < length; ++i) {
                is.read(bytes);
            }
        }
        is.close();
    }

This simple GET based client gives me the results:

116680
210056
223503
160247
38882
59810
749384
0
922
1
0

Please note that the length 0 is not followed by data, i.e., it is followed by the length of the next word or the end of the data stream.

MichaelRoeder commented 7 years ago

Note that the same holds for words with an underscore. Maybe switching to GET for calls of the df service solves this issue as well as issue #10

Another possibility is that the POST request you are creating and I am using for testing are malformed.

earthquakesan commented 7 years ago

was a bug in my library, thanks for pointers!