DIGITALCRIMINAL / ArchivedUltimaScraper

Scrape content from OnlyFans and Fansly
GNU General Public License v3.0
948 stars 39 forks source link

Bringing Back Changes #69

Closed aboredpervert closed 1 year ago

aboredpervert commented 2 years ago

I have accumulated some changes on your code over time and would like to merge back as much as possible so it can be easier to update my fork when you make changes. Before I start making PRs I will list what I want to merge and why and you can tell me which ones you want.

  1. Usability This is the result of trying to help non-technical people to setup the scraper during this year:
  2. Fixes
  3. Performance
  4. Scraping

Anyway, that's all. Tell me what of this you'd like to see as a PR.

OFfriend commented 2 years ago

Some models post hot stuff that expire after 1 month. Checking only expiredAt isn't enough.

Better check post.linkedPosts , post.linkedUsers, post.hasUrl , post.mentionedUsers and post.expiredAt

my javascript code

function isPromo(post) {
  if (post.rawText === null) {
    return false;
  }
  return (post.rawText.includes('onlyfans.com/action/trial/')
    || (post.linkedPosts.length && post.linkedUsers.length && post.hasUrl && !_.isNull(post.expiredAt))
    || ((!_.isEmpty(post.linkedUsers) || !_.isEmpty(post.mentionedUsers)) && !_.isNull(post.expiredAt)));
}
aboredpervert commented 2 years ago

Thanks, this is very clever and much more accurate. I may take some inspiration from it.

There is still the issue that the presence or absence of timed posts can be used to infer who scraped the profile, though, so independently of filtering spam, a toggle for timed posts is still required.

americanseeder1865 commented 2 years ago

Made it ask less questions on startup (no site/media choice) in aboredpervert@4c1414a. Surprisingly, those extra questions confused people, so now it defaults to OnlyFans+all media, which is what you want all the time anyway.

What if a user wants to download from another site? I agree that the UI needs to be updated, but having a user edit a file to get the site they want is not ideal.

Made it open the config/auth files in a text editor (default to Notepad or vi) when they have changed in aboredpervert@f7b531e. When it asked people to edit them manually, a lot of them had trouble finding them.

I love this. The only problem I see is that we should figure out a way to not use JSON for this purpose only. Non-Technical users don't care about spacing/formatting in a JSON file.

The big one: did all the work to be able to create a standalone Windows EXE using PyInstaller in aboredpervert@d913975 (+ extra fixes in aboredpervert@105b157). Completely solves the problem of Windows users being unable to setup Python + all the pip dependencies correctly.

This is awesome too! before we consider merging this, I would like to make it so that it works in Linux and Docker.

We had already discussed it, it was merged once, then reverted because it suddenly changed some file names for people. But now that OFRenamer exists, that is not a problem anymore. Linux should consider that it can always allow long paths (aboredpervert@054cfad) even if post text in filenames is indeed a bad idea (and useless).

I would like to make it so that if a user changes file name formats in config.json, the directories would change with it. I am working on a tool seperate from this project because the program I use to view stuff has a different way of doing things.

We may not agree on this one. I have removed all the extra HEAD requests, at the expense of progress bar accuracy in aboredpervert@8affdf5. Maybe this could be a configuration setting like "fast download" or something?

Instead of a progress bar for every method that is being used for the model, we should do a total progress bar that also displays download speeds in the appropriate bits/s. We can also go down the bath of increasing/decreasing verbosity.

Stopped it from scraping sent PMs in aboredpervert@8006a5b. Very annoying, you don't want to see your own stuff in your rip.

This is annoying, but we should also give the user the ability to do what they want. Sound like a config setting!

aboredpervert commented 2 years ago

What if a user wants to download from another site? I agree that the UI needs to be updated, but having a user edit a file to get the site they want is not ideal.

When I made the change initially, there was no Fansly support yet. And I have never ever seen someone try to use it for something other than OnlyFans yet. Maybe as Fansly grows that will change.

I love this. The only problem I see is that we should figure out a way to not use JSON for this purpose only. Non-Technical users don't care about spacing/formatting in a JSON file.

I agree, but I was not trying to make a big change, just to fix one thing. Ideally the config file should use a format more adapted for configuration like TOML. That would allow having comments explaining the options too, instead of having them in the README.

This is awesome too! before we consider merging this, I would like to make it so that it works in Linux and Docker.

You mean that the Windows EXE would be built from Linux? I'd have to check if PyInstaller supports this (hopefully). For now I have been building from a Windows VM. I have no experience in Docker however.

I would like to make it so that if a user changes file name formats in config.json, the directories would change with it. I am working on a tool seperate from this project because the program I use to view stuff has a different way of doing things.

I thought that was what OFRenamer did? The scraper writes the paths of the medias it saved in the database, and if the format changed, it would move everything in place. Or does it only rename files, not handle directories?

Instead of a progress bar for every method that is being used for the model, we should do a total progress bar that also displays download speeds in the appropriate bits/s. We can also go down the bath of increasing/decreasing verbosity.

Yes, this again would require deep changes, and my problem was not from a verbosity point of view. It was for the fact that for every GET request the scraper does, it also precedes it by an HEAD request just to get the size to display a fancy progress bar (also to check if the file was partially downloaded, but that is made unneeded by downloading to a temporary file). This is a pattern of requests that is very easy to spot in webserver logs and could be used to identify (and block?) those using the scraper.

This is annoying, but we should also give the user the ability to do what they want. Sound like a config setting!

I can't imagine a use case where someone would want to save what they just sent (don't they still have a local copy?), but you are still right.

americanseeder1865 commented 2 years ago

You mean that the Windows EXE would be built from Linux? I'd have to check if PyInstaller supports this (hopefully). For now I have been building from a Windows VM. I have no experience in Docker however.

I want to make sure that the changes you have made won't break things on Docker and Linux.

aboredpervert commented 2 years ago

I want to make sure that the changes you have made won't break things on Docker and Linux.

That should not be a problem. I run it on Linux myself, so if I had broken something I would know.

americanseeder1865 commented 2 years ago

Well awesome then, you should definitely do a PR for that now. That would be extremely helpful. If you can in your PR.

aboredpervert commented 2 years ago

Well awesome then, you should definitely do a PR for that now. That would be extremely helpful. If you can in your PR.

I have a branch ready, but it will need to wait a little. PyInstaller still has issues with 3.10, so it does not work yet.

DIGITALCRIMINAL commented 2 years ago

Hi thanks for the potential commits, sorry I was away doing stuff for weeks so I couldn't respond properly.

  • Made it ask less questions on startup (no site/media choice) in aboredpervert@4c1414a. Surprisingly, those extra questions confused people, so now it defaults to OnlyFans+all media, which is what you want all the time anyway.

I know mainly people use this script to download OF, but it's set up in way that you should open up the configuration and apply the settings yourself to automate it.

  • Changed the default filename format so it will make proper filenames that sort by post order in aboredpervert@7f84286. People don't bother to customize it, and it makes for unorganized rips. Yes, the scraper will set the file times by default, but that is often not preserved across many file transfer methods.

Should just be left to the user if they want to include dates in their names. I chose to keep the filenames as 1:1 as possible so that users can paste filenames in google or custom databases that look up models.

  • Stopped it from scraping sent PMs in aboredpervert@8006a5b. Very annoying, you don't want to see your own stuff in your rip.

Some models use the script to archive their entire content, including stuff they've posted, should be an optional choice.

  • Made it open the config/auth files in a text editor (default to Notepad or vi) when they have changed in aboredpervert@f7b531e. When it asked people to edit them manually, a lot of them had trouble finding them.
  • The big one: did all the work to be able to create a standalone Windows EXE using PyInstaller in aboredpervert@d913975 (+ extra fixes in aboredpervert@105b157). Completely solves the problem of Windows users being unable to setup Python + all the pip dependencies correctly.
  • We had already discussed it, it was merged once, then reverted because it suddenly changed some file names for people. But now that OFRenamer exists, that is not a problem anymore. Linux should consider that it can always allow long paths (aboredpervert@054cfad) even if post text in filenames is indeed a bad idea (and useless).
  • Another thing we had discussed before: downloading files to temporary names and renaming to the final name on successful download (aboredpervert@fbba181). Every browser does it already (Firefox & Chrome), and a lot of other scrapers (youtube-dl, Instaloader) also do it. It makes it simpler (no need to check file sizes later to see if it was a partial download) but also allows file sync/indexing tools to ignore them, thumbnailers to avoid trying to process incomplete images/videos, etc. This time I have written it so that does the least amount of extra I/O possible: on the successful path, there is only one extra rename.
  • The quick and dirty fix for async issues from Error with 7.5.1 release  #44 on Windows in aboredpervert@7414cfb.

These are fine to commit.

  • We may not agree on this one. I have removed all the extra HEAD requests, at the expense of progress bar accuracy in aboredpervert@8affdf5. Maybe this could be a configuration setting like "fast download" or something?

Yeah this is good but should be implemented as an optional setting, I can do this.

You can do the OF commits, I will just replicate it to all the other site modules.

I'm also redesigning the UI in it's entirety in https://github.com/willmcgugan/rich so it can be more user friendly.