Closed earthquakesan closed 7 years ago
In a more complex example, it will lead to completely unexpected results. For instance, "familycolor" and "isoexception" have no document ids accociated. However, the results show that the following words contain document ids (it basically trims the last two items):
dict_keys(['type', 'national', 'subject', 'familycolor', 'fam', 'discipline', 'character', 'label', 'topic'])
The full example here:
(Pdb) words_does_not_work
['label', 'type', 'character', 'subject', 'discipline', 'topic', 'national', 'familycolor', 'fam', 'glotto', 'isoexception']
(Pdb) len(palmetto.get_df_for_words(['label'])[0][1])
116680
(Pdb) len(palmetto.get_df_for_words(['type'])[0][1])
210056
(Pdb) len(palmetto.get_df_for_words(['character'])[0][1])
223503
(Pdb) len(palmetto.get_df_for_words(['subject'])[0][1])
160247
(Pdb) len(palmetto.get_df_for_words(['discipline'])[0][1])
38882
(Pdb) len(palmetto.get_df_for_words(['topic'])[0][1])
59810
(Pdb) len(palmetto.get_df_for_words(['national'])[0][1])
749384
(Pdb) len(palmetto.get_df_for_words(['familycolor'])[0][1])
*** IndexError: list index out of range
(Pdb) len(palmetto.get_df_for_words(['fam'])[0][1])
922
(Pdb) len(palmetto.get_df_for_words(['glotto'])[0][1])
1
(Pdb) len(palmetto.get_df_for_words(['isoexception'])[0][1])
*** IndexError: list index out of range
(Pdb) dict(doc_ids).keys()
dict_keys(['type', 'national', 'subject', 'familycolor', 'fam', 'discipline', 'character', 'label', 'topic'])
public static void main(String[] args) throws Exception {
String url = "http://palmetto.aksw.org/palmetto-webapp/service/df?words=label%20type%20character%20subject%20discipline%20topic%20national%20familycolor%20fam%20glotto%20isoexception";
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
con.setRequestMethod("GET");
System.out.println("Response Code : " + con.getResponseCode());
InputStream is = new BufferedInputStream(con.getInputStream());
ByteBuffer buffer;
byte bytes[];
int length;
// Go through all words
while (is.available() > 0) {
bytes = new byte[4];
is.read(bytes);
buffer = ByteBuffer.wrap(bytes);
length = buffer.getInt();
// print the length
System.out.println(length);
// skip the data
for (int i = 0; i < length; ++i) {
is.read(bytes);
}
}
is.close();
}
This simple GET
based client gives me the results:
116680
210056
223503
160247
38882
59810
749384
0
922
1
0
Please note that the length 0
is not followed by data, i.e., it is followed by the length of the next word or the end of the data stream.
Note that the same holds for words with an underscore. Maybe switching to GET
for calls of the df
service solves this issue as well as issue #10
Another possibility is that the POST
request you are creating and I am using for testing are malformed.
was a bug in my library, thanks for pointers!
In case if one of the words submitted to /df service does not contain any documents, the item will be skipped in a query. This results in a wrong parsing of a request as the df does not contain NULL int (i.e. four empty bytes). For example:
Here, "glotto" has one document id "1707408" and "isoexception" has none. However, due to absense of NULL int, the bytestream parsed in a way that the first item is assigned the received document id and the second item is simply ignored (because the end of a stream).