AmenRa / retriv

A Python Search Engine for Humans 🥸
MIT License
174 stars 20 forks source link

[BUG] Corrupted log when using SearchEngine #24

Closed Ch41r05 closed 1 year ago

Ch41r05 commented 1 year ago

Hi again,

I'm stuck with a strange behavior, that from my tests seems to be related to the use of the SearchEngine. I'm using a SingletonLogger that logs everything to stdout and persists that log onto a file. When the program runs, the index takes a bit of time to be calculated and if I check the logfile, I can correctly see everything printed until this point. After the SearchEngine finishes calculating the index, the first row of the logfile becomes a series of nul values. Below a sample of code and of the log file. Can anyone give me pointers to solve this?

_logger = Logger()

[code doing stuff, collecting collection mainly]

_logger.info("Building index...")
SearchEngine("new-index").index(collection, show_progress=False)
_logger.info("Index built.")

logfile.log

AmenRa commented 1 year ago

Could you provide a reproducible snippet?

Ch41r05 commented 1 year ago

Hi @AmenRa ,

Here is a script that reproduces the error. The logger class only has this problem when SearchEngine().index() is run, if commented the logs are correctly printed.

I may have done something wrong, but I can't seem to find the cause for this since every other test I've run didn't reproduce the issue.

binary_log.zip

AmenRa commented 1 year ago

Hi, i removed from commons.logger import Logger as I do not know what commons is.

Everything seems to work fine on my end. I get this in both my terminal and the log file:

[2023-09-02 09:16:28,031] {binary_log.py:55} INFO - Building index...
[2023-09-02 09:16:28,303] {binary_log.py:57} INFO - Index built.
Ch41r05 commented 1 year ago

@AmenRa I checked and confirmed that even without the import the console log works fine but the log file still has the problem, I tested this script on two different PCs which both run windows. Maybe this could be something related to the operating system?

Also, can I ask you your configuration? Thing like OS, Python version, and the such. Thanks a lot

Ch41r05 commented 1 year ago

Hi @AmenRa I did some further testing and I found something interesting. If I change the order of the lines

 logger.info("Building index...")
 SearchEngine("new-index").index(collection, show_progress=False)
 logger.info("Index built.")

to

 logger.info("Building index...")
 logger.info("Index built.")
 SearchEngine("new-index").index(collection, show_progress=False)

the log file becomes empty. So it seems from the tests I conducted that somehow rertiv is still sending something to stdout even when with show_progress=False. If I comment the indexing line, the log correctly is shown in the file. I'm running on windows 11 with python 3.10.0. I don't know if I can give you any more information, but please tell me if you need something else to try to reproduce the error in your environment.

Ch41r05 commented 1 year ago

Hi @AmenRa , quick update, I tried changing the logging library with loguru, and the issue still occurs when opening the file in W mode. From further digging, I found that the log file has NUL (yes, with just one "L") values before the last line, this seems to be an issue related to the underlying open file command and multithreading/multiprocessing which, given the high speed of retriv, I guess it's being done under the hood. I'll try to see if I can dig some other information.

AmenRa commented 1 year ago

Hi, show_progress is passed to a tqdm progress bar in this fashion:

tqdm(
  ...
  disable=not show_progress,
)

Try running a loop with tqdm like this to verify whether tqdm is (one of) the cause:

from tqdm import tqdm

logger.info("Building index...")
for _ in tqdm(range(10_000), disable=True):
    continue
logger.info("Index built.")

Other tqdm options that I use are:

desc="something"
dynamic_ncols=True
mininterval=0.5

Honestly, I am not aware of other things that may interfere with a logger. Also, I do not have a Windows machine, so maybe the issue is related to Windows / Windows + tqdm.

Ch41r05 commented 1 year ago

Hi @AmenRa , thanks for your help! I ran the sample code you provided me, sadly (?) everything worked fine, the output was the expected one in both the console and the logfile. Seems tqdm isn't the culprit here. I agree that the problem may lie in the fact that I'm using Windows. As I was saying this kind of behavior seems to be linked on how files are opened under the hood. I'll try to dig deeper and keep you updated if I find anything interesting, since this behavior is strange and not as intended I guess!

Ch41r05 commented 1 year ago

Hi @AmenRa, thanks for your support, I finally solved the issue. The problem doesn't lie in retriv per se, but in the fact that's using multiprocessing. The logger was instantiated outside the if name == "main" section of the code, so the multiprocessing executed it more than once, leading to unexpected output. Once the instantiation of the code has been moved under the if, everything works as expected. Leaving this here so other windows users won't be caught in OS this difference.