Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
713 stars 132 forks source link

I cannot read queries.csv with python #2847

Open LBeaudoux opened 3 years ago

LBeaudoux commented 3 years ago

When I try to read the latest queries.csv file with python, multiple errors related to encoding are raised. I did not have this problem with the previous version of the file.

>>> with open("queries.csv", "r", encoding="utf-8") as f:
...     cnt_errors = 0
...     while True:
...         try:
...             line = next(f)
...         except StopIteration:
...             break
...         except UnicodeDecodeError:
...             cnt_errors += 1
... 

>>> print(cnt_errors)
2767

In addition, some queries are not processed correctly by the sed command like the one at line 21777: [Sun Apr 7 01:16:41.523 2019] 0.002 sec 0.002 sec [ext2/4/ext 0 (0,10)] [fra_main_index,fra_delta_index] fr�quemment

Until we have a better parser, would it be possible to share the source query.log file?

LBeaudoux commented 3 years ago

In the more recent 14-Sep-2021 01:12 version of queries.csv.bz2, I only found one faulty row at line 10665247: [Mon Nov 4 Nov 2019,eng,contradictory

However, the UnicodeDecodeError issue remains. But interestingly, when I open queries.csv with VS Code, I get the following modal:

Screenshot from 2021-09-12 22-08-30

And after clicking on Remove Unusual Line Terminators, I can read the file without raising exceptions.

I also noticed that since the search refactoring of last year, queries are not systematically duplicated in the log. However, some queries still appear several times in a row or separated by a few lines. I suppose that a new entry is recorded when a user changes page. Is it possible to track the pages visited during a single search in query.log? If so, having the source file would help me a lot to eliminate duplicates.

jiru commented 3 years ago

In the more recent 14-Sep-2021 01:12 version of queries.csv.bz2, I only found one faulty row at line 10665247: [Mon Nov 4 Nov 2019,eng,contradictory

I looked into this. The original file has the same error, so I am not sure what to do about it. My guess is that the search daemon was stopped in the middle of a log write, or we ran out of disk space, or something like that.

However, the UnicodeDecodeError issue remains.

It is caused by clients sending invalid queries. For the sake of transparency, and because it’s easier, I think it’s better to leave them in the log. These queries shouldn’t return any result anyway, so you can safely ignore them by catching the exception.

I also noticed that since the search refactoring of last year, queries are not systematically duplicated in the log.

Can you give an example please?

However, some queries still appear several times in a row or separated by a few lines. I suppose that a new entry is recorded when a user changes page.

Your assumption is wrong. It is duplicated because the search engine is queried twice; once for the total number of results and once for the list of results. To my knowledge, this happens on any page.

Until we have a better parser, would it be possible to share the source query.log file?

No, sorry.

LBeaudoux commented 3 years ago

@jiru , thanks for your explanation.

Can you give an example please?

Until May 17, 2020, each query appears 2 times, mostly consecutively:

17 May 2020,epo,fajro|fajron|fajroj|fajrojn
17 May 2020,epo,fajro|fajron|fajroj|fajrojn
17 May 2020,jpn,v
17 May 2020,jpn,v
17 May 2020,jpn,自然
17 May 2020,jpn,自然

But from line 19,234,247 of queries.csv (2021-09-14 version), the consecutive duplicates disappear most of the time:

17 May 2020,eng,=oh
17 May 2020,eng,=night
17 May 2020,eng,=right
17 May 2020,eng,=three
17 May 2020,eng,=against
17 May 2020,eng,=oh

I counted the daily occurrences of the queries after this changeover date with the help of the following statement:

WITH queries_daily_occurrences AS (
    SELECT 
        date,
        language, 
        content, 
        count(*) AS daily_occurrences
    FROM queries
    WHERE date > "2020-05-17"
    GROUP BY date, language, content
)
SELECT daily_occurrences, count(*) as nb_queries
FROM  queries_daily_occurrences
GROUP BY daily_occurrences
ORDER BY daily_occurrences
LIMIT 6

As the browsing of the file suggested, this statement returns a vast majority of unduplicated queries:

1    3,458,284
2    533,374
3    111,334
4    43,298
5    19,599
6    11,143

The counts are very different for the queries prior to the switchover date:

1    26
2    3,481,622
3    11
4    486,567
5    5