PeskyPotato / archive-chan

Download threads from 4chan including media
MIT License
24 stars 6 forks source link

Keep checking threads untill they'e either archived or 404 #2

Open cardoso-neto opened 3 years ago

cardoso-neto commented 3 years ago

From what I could see, archive-chan currently only downloads snapshots of the threads instead of "watching" them for new posts until completion. I'm thinking we could add --watch-threads flag or something like that. I would gladly implement this. Your archiver is the most complete I've found so far. I would just like to discuss this with you as I'm not sure how to do this yet.

PeskyPotato commented 3 years ago

The flag seems good. I was thinking about setting up a timer in feeder() to call archive() on the thread url repeatedly. We can then check to see if the image is written in the thread directory, skip if it is to save bandwidth. As for the text, I was thinking of the just overwriting the HTML file on each iteration.

Not sure if there's a better way of doing it, if you have any idea let me know.

cardoso-neto commented 3 years ago

Sounds pretty good. Might I recommend checking if the media file is already downloaded with:

from pathlib import Path

media_file_path = Path(f"{reply['filename']}{reply['ext']}")
if params.preserve and not media_file_path.is_file():
    # download
cardoso-neto commented 3 years ago

As for the text, I was thinking of the just overwriting the HTML file on each iteration.

What if a mod deleted one of the posts? Would we now lose that post? That would be undesirable.

cardoso-neto commented 3 years ago

I thought a bit more about my suggested way of checking if a file has already been downloaded. Since it only checks if the file path exists, we could end up with corrupt or incomplete files. I'm thinking we need a way to check if the whole file has been downloaded. Checking the file sizes is the first that comes to mind.

So here is what I did to the Extractor.download method:

def download(self, path, name, params, retries=0):
        """
        Donwload file from `path` to `name`.

        If it fails wait and retry, until total retries reached.
        """
        file_path = Path('{}/{}'.format(params.path_to_download, name))
        try:
            if file_path.is_file():
                response = requests.head(path, timeout=10)
                size_on_the_server = response.headers.get("content-length", 0)
                if file_path.stat().st_size == size_on_the_server:
                    return
            if params.verbose:
                print("Downloading image:", path, name)
            response = requests.get(path, timeout=240)
            with open(file_path, "wb") as output:
                output.write(response.content)
        except Exception as e:
            if params.total_retries > retries:
                print(e, file=sys.stderr)
                print(f"Retry #{retries}")
                retries += 1
                self.download(path, name, params, retries)

I'm still not sure if this is completely safe, though.

cardoso-neto commented 3 years ago

So, huge bug found, because we were saving the media files with their original file names, anytime a thread had more than one media file with the same name, only the most recent one would be saved. This is easily solvable.

We could go for one of the following (or maybe make it configurable?):

What do you think?

PeskyPotato commented 3 years ago

I think the hash would probably be the best option since we get it from the 4chan API and it's included in the Reply model. That way we can preserve the file name to the original but just compare the hash returned by the API with the file that exists in the folder. What do you think @cardoso-neto?

cardoso-neto commented 3 years ago

I think the hash would probably be the best option

Here it looks like you want option c.

we can preserve the file name to the original

But here it looks like you want option b.

just compare the hash returned by the API with the file that exists in the folder

And here it looks like you want option c again. (confused-emoji)

If there's any chance we can reproduce those md5 checksums, then option c is definitely the best because it'd solve the redownload problem perfectly. This is what they look like: MJoiDDK2ehvXP3fvM1wdAw== kinda looks like base64 encoding to me. If we can't, then going with option b seems like a good enough choice, since it'd be the best of botth worlds (uniqueness as well as readability). I'll experiment a bit with it and get back to you. I already have a fork where I'm working on stuff, btw. cardoso-neto/archive-chan

cardoso-neto commented 3 years ago

So, I managed to reproduce 4chan's base64-encoded binary md5 hash with openssl md5 -binary $filename | openssl base64. Next step is choosing how to deal with the filenames. I'm thinking we could just use 4chan's standard post id and save the original .json file from 4chan's API, so we'd also have the original filenames and hashes. Maybe something like this for the folder structure:

board/
..  thread_id/
..  ..  media/
..  ..  ..  post-id.png
..  ..  index.html
..  ..  thread.json
cardoso-neto commented 3 years ago

Somewhat related to this issue, I created two new command-line switches: --archived and --archived_only. They're to be used when supplying a board letter (like /mlp/) so you can download the threads on the /mlp/archive/ as well. I couldn't branch off of master because it didn't have my "retrying timeouting requests session", so I branched off of my own branch. This is the commit https://github.com/cardoso-neto/archive-chan/commit/ad94c40abeecd884a175943d0f7cb36b4d034d1d if you feel like doing a code review.

PeskyPotato commented 3 years ago

Thank you @cardoso-neto , I will take a look at the commit this week. I appreciate the time you've put into this.

:slightly_smiling_face: