Open LBeaudoux opened 3 years ago
In the more recent 14-Sep-2021 01:12
version of queries.csv.bz2
, I only found one faulty row at line 10665247:
[Mon Nov 4 Nov 2019,eng,contradictory
However, the UnicodeDecodeError
issue remains. But interestingly, when I open queries.csv
with VS Code, I get the following modal:
And after clicking on Remove Unusual Line Terminators
, I can read the file without raising exceptions.
I also noticed that since the search refactoring of last year, queries are not systematically duplicated in the log. However, some queries still appear several times in a row or separated by a few lines. I suppose that a new entry is recorded when a user changes page. Is it possible to track the pages visited during a single search in query.log
? If so, having the source file would help me a lot to eliminate duplicates.
In the more recent
14-Sep-2021 01:12
version ofqueries.csv.bz2
, I only found one faulty row at line 10665247:[Mon Nov 4 Nov 2019,eng,contradictory
I looked into this. The original file has the same error, so I am not sure what to do about it. My guess is that the search daemon was stopped in the middle of a log write, or we ran out of disk space, or something like that.
However, the
UnicodeDecodeError
issue remains.
It is caused by clients sending invalid queries. For the sake of transparency, and because it’s easier, I think it’s better to leave them in the log. These queries shouldn’t return any result anyway, so you can safely ignore them by catching the exception.
I also noticed that since the search refactoring of last year, queries are not systematically duplicated in the log.
Can you give an example please?
However, some queries still appear several times in a row or separated by a few lines. I suppose that a new entry is recorded when a user changes page.
Your assumption is wrong. It is duplicated because the search engine is queried twice; once for the total number of results and once for the list of results. To my knowledge, this happens on any page.
Until we have a better parser, would it be possible to share the source
query.log
file?
No, sorry.
@jiru , thanks for your explanation.
Can you give an example please?
Until May 17, 2020, each query appears 2 times, mostly consecutively:
17 May 2020,epo,fajro|fajron|fajroj|fajrojn
17 May 2020,epo,fajro|fajron|fajroj|fajrojn
17 May 2020,jpn,v
17 May 2020,jpn,v
17 May 2020,jpn,自然
17 May 2020,jpn,自然
But from line 19,234,247 of queries.csv
(2021-09-14 version), the consecutive duplicates disappear most of the time:
17 May 2020,eng,=oh
17 May 2020,eng,=night
17 May 2020,eng,=right
17 May 2020,eng,=three
17 May 2020,eng,=against
17 May 2020,eng,=oh
I counted the daily occurrences of the queries after this changeover date with the help of the following statement:
WITH queries_daily_occurrences AS (
SELECT
date,
language,
content,
count(*) AS daily_occurrences
FROM queries
WHERE date > "2020-05-17"
GROUP BY date, language, content
)
SELECT daily_occurrences, count(*) as nb_queries
FROM queries_daily_occurrences
GROUP BY daily_occurrences
ORDER BY daily_occurrences
LIMIT 6
As the browsing of the file suggested, this statement returns a vast majority of unduplicated queries:
1 3,458,284
2 533,374
3 111,334
4 43,298
5 19,599
6 11,143
The counts are very different for the queries prior to the switchover date:
1 26
2 3,481,622
3 11
4 486,567
5 5
When I try to read the latest
queries.csv
file with python, multiple errors related to encoding are raised. I did not have this problem with the previous version of the file.In addition, some queries are not processed correctly by the
sed
command like the one at line 21777:[Sun Apr 7 01:16:41.523 2019] 0.002 sec 0.002 sec [ext2/4/ext 0 (0,10)] [fra_main_index,fra_delta_index] fr�quemment
Until we have a better parser, would it be possible to share the source
query.log
file?