Can not ensure order of items in /df service

earthquakesan commented 7 years ago

In case if one of the words submitted to /df service does not contain any documents, the item will be skipped in a query. This results in a wrong parsing of a request as the df does not contain NULL int (i.e. four empty bytes). For example:

(Pdb) palmetto.get_df_for_words(['glotto', 'isoexception'])
[('glotto', {1707408})]
(Pdb) palmetto.get_df_for_words(['isoexception', 'glotto'])
[('isoexception', {1707408})]
(Pdb) palmetto.get_df_for_words(['glotto'])
[('glotto', {1707408})]
(Pdb) palmetto.get_df_for_words(['isoexception'])
[]

Here, "glotto" has one document id "1707408" and "isoexception" has none. However, due to absense of NULL int, the bytestream parsed in a way that the first item is assigned the received document id and the second item is simply ignored (because the end of a stream).

earthquakesan commented 7 years ago

In a more complex example, it will lead to completely unexpected results. For instance, "familycolor" and "isoexception" have no document ids accociated. However, the results show that the following words contain document ids (it basically trims the last two items):

dict_keys(['type', 'national', 'subject', 'familycolor', 'fam', 'discipline', 'character', 'label', 'topic'])

The full example here:

(Pdb) words_does_not_work
['label', 'type', 'character', 'subject', 'discipline', 'topic', 'national', 'familycolor', 'fam', 'glotto', 'isoexception']
(Pdb) len(palmetto.get_df_for_words(['label'])[0][1])
116680
(Pdb) len(palmetto.get_df_for_words(['type'])[0][1])
210056
(Pdb) len(palmetto.get_df_for_words(['character'])[0][1])
223503
(Pdb) len(palmetto.get_df_for_words(['subject'])[0][1])
160247
(Pdb) len(palmetto.get_df_for_words(['discipline'])[0][1])
38882
(Pdb) len(palmetto.get_df_for_words(['topic'])[0][1])
59810
(Pdb) len(palmetto.get_df_for_words(['national'])[0][1])
749384
(Pdb) len(palmetto.get_df_for_words(['familycolor'])[0][1])
*** IndexError: list index out of range
(Pdb) len(palmetto.get_df_for_words(['fam'])[0][1])
922
(Pdb) len(palmetto.get_df_for_words(['glotto'])[0][1])
1
(Pdb) len(palmetto.get_df_for_words(['isoexception'])[0][1])
*** IndexError: list index out of range
(Pdb) dict(doc_ids).keys()
dict_keys(['type', 'national', 'subject', 'familycolor', 'fam', 'discipline', 'character', 'label', 'topic'])

MichaelRoeder commented 7 years ago

public static void main(String[] args) throws Exception {
        String url = "http://palmetto.aksw.org/palmetto-webapp/service/df?words=label%20type%20character%20subject%20discipline%20topic%20national%20familycolor%20fam%20glotto%20isoexception";
        URL obj = new URL(url);
        HttpURLConnection con = (HttpURLConnection) obj.openConnection();

        con.setRequestMethod("GET");
        System.out.println("Response Code : " + con.getResponseCode());

        InputStream is = new BufferedInputStream(con.getInputStream());
        ByteBuffer buffer;
        byte bytes[];
        int length;
        // Go through all words
        while (is.available() > 0) {
            bytes = new byte[4];
            is.read(bytes);
            buffer = ByteBuffer.wrap(bytes);
            length = buffer.getInt();
            // print the length
            System.out.println(length);
            // skip the data
            for (int i = 0; i < length; ++i) {
                is.read(bytes);
            }
        }
        is.close();
    }

This simple GET based client gives me the results:

Please note that the length 0 is not followed by data, i.e., it is followed by the length of the next word or the end of the data stream.

MichaelRoeder commented 7 years ago

Note that the same holds for words with an underscore. Maybe switching to GET for calls of the df service solves this issue as well as issue #10

Another possibility is that the POST request you are creating and I am using for testing are malformed.

earthquakesan commented 7 years ago

was a bug in my library, thanks for pointers!

dice-group / Palmetto

Can not ensure order of items in /df service #11