fake-name / xA-Scraper

69 stars 8 forks source link

Exclude retweets from gen_tweets_for_date_span #76

Closed God-damnit-all closed 4 years ago

God-damnit-all commented 4 years ago

Excludes retweets from the search function using the filters documented here.

As for the timeline, while there's an option documented here to exclude retweets from the timeline, this is only visual for the end-user, it's still in the json object that's returned.

God-damnit-all commented 4 years ago

The alternative is including retweets in the same folder as the user they were retrieved from, but that's going to make the filename scheme a bit muddy if you want it to be in order. You could start with ID, but retweets have their own unique date to make them appear in a certain place in the timeline. And then there's the issue of when someone retweets themselves...

Honestly I don't think retweets are of much value, there are some instances where someone retweets someone else that posted their work instead of posting it themselves, but more often than not it's completely unrelated stuff.

fake-name commented 4 years ago

I believe I already save retweets in a folder corresponding to the original tweeter's name. If not, that's a legitimate issue.

God-damnit-all commented 4 years ago

@fake-name My Twitter folder currently has 6692 subfolders, so I'm afraid something is not working right, yes.

fake-name commented 4 years ago

That sounds about right?

Basically, it creates a folder for the original tweeter, and saves the retweeted content there.

It's more of a timeline archiving tool, rather then a tweet archiving tool (though the former is technically a superset of the latter).


I chose this approach because a lot of the people I'm interested in archiving retweet other people who have art that I'm also interested in, but haven't added yet. There's also a lot of retweeting of fan-art, and similar stuff.

God-damnit-all commented 4 years ago

Why would that be useful? If it's someone retweeting their own art posted on someone else's account, it would make more sense for it to be in the folder of the person who retweeted it.

Putting each retweet in its own folder of the original poster has turned the directory structure into a labyrinth.

fake-name commented 4 years ago

I chose this approach because a lot of the people I'm interested in archiving retweet other people who have art that I'm also interested in, but haven't added yet. There's also a lot of retweeting of fan-art of their OCs by other people, and similar stuff.

I could archive retweets in a subdirectory of the original target artist that's being scraped, but that'd lead to duplicates if people RT each other. Right now, If two artists who are scraped RT each other, you'd get two copied of each tweet.

God-damnit-all commented 4 years ago

I just addressed that. If you don't put the retweets in the folder of the person who retweeted it, no association is being made.

fake-name commented 4 years ago

no association is being made.

The database manages that part: https://github.com/fake-name/xA-Scraper/blob/master/xascraper/modules/twit/twitScrape.py#L114-L117

The tweets are stored per scraped target in the DB, but backing store is basically tweet/retweet agnostic.

God-damnit-all commented 4 years ago

The database manages that part.

That is a very frustrating answer.

God-damnit-all commented 4 years ago

The tweets are stored per scraped target in the DB, but backing store is basically tweet/retweet agnostic.

Call me crazy but I've always preferred browsing through files with a file manager, and this folder structure makes that very, very unpleasant. Could it please not stay like this?

fake-name commented 4 years ago

I don't have a better way of saving that sort of metadata. Saving RTs locally in the folder of the person who retweeted something would mean you'd get duplicates if you have multiple people RT the same content.

I'm super disk-space aware, so anything that produces duplicates is something I REALLY want to avoid. I'm already 70% of the way to convincing myself to rewrite the entire backing store in a way that would let me do fuzzy image deduplication (See the stuff for that I've already written).

Frankly, I'm super tempted to write a content-addressed storage of some sort as a FUSE module or something. At that point, the resulting filesystem store would be COMPLETELY opaque and not user accessible, but I could do some really neat stuff with data deduplication.

Another motivation for that approach is a bunch of my projects basically have a backend that boils down to "store a bunch of images (with possible duplicates)", so improving that component would apply to a lot of my stuff.

Could it please not stay like this?

It is like this right now? But it has a lot of implications I don't like.

God-damnit-all commented 4 years ago

Are hard links not an option?

fake-name commented 4 years ago

I don't like them? Also any tools I'd then use for backup would need to be aware of them?

Also, I run my scraping systems on a different system then my disk storage, so it makes things like network filesystems another fiddly failure point (though I'm mostly NFS based ATM, so that'd probably not a major issue).

God-damnit-all commented 4 years ago

I don't like them? Also any tools I'd then use for backup would need to be aware of them?

Creating an OS-specific virtual file system via FUSE seems a hell of a lot more complicated.

It is like this right now? But it has a lot of implications I don't like.

I know, I was pleading with you for it to not remain like this.

fake-name commented 4 years ago

Creating an OS-specific virtual file system via FUSE seems a hell of a lot more complicated.

Yeah, but it'd be to accomplish something a hell of a lot more complicated (content-addressable storage by perceptual hash).

Also, the actual FS layout can be super dumb, it's just the translation layer that'd need to be complex. It likely not be actually FUSE based, but rather a CRUD-like web-service that makes content accessible, with opaque (I've used storage-by-sha1 elsewhere with success) underlying content.

Also, something FUSE based sounds like a fun project, which isn't a great reason, but this is hobby shit.

God-damnit-all commented 4 years ago

The problem with that approach is that files can have the same exact image data but different SHA1 hashes due to minor changes in their metadata. You'd have to come up with a new image saving process where you took the image data of what you downloaded and recreate the file.

fake-name commented 4 years ago

You'd have to come up with a new image saving process where you took the image data of what you downloaded and recreate the file.

content-addressable storage by perceptual hash

That's literally what I'm talking about. Did you see the project of mine which exists specifically to speed up fuzzy image searching? Or the project that does fuzzy image deduplication within compressed archives.

God-damnit-all commented 4 years ago

Well sorry, it's the first time I've heard the phrase 'perceptual hash'.

fake-name commented 4 years ago

Oh, lol, np.

I've probably spent far too much time thinking about this particular issue.

God-damnit-all commented 4 years ago

Could a distinction between users I'm following and not following be made for the folder structure? Within the twitter folder, a Following folder and a Retweeted folder, perhaps?

Then perhaps the web interface could get a check for when you add someone, if their tweets are in Retweeted, it moves it into the Following folder instead.

I don't know. I just want to browse files with a file manager. I've never been overly fond of using web interfaces for everything.

God-damnit-all commented 4 years ago

That said, in regards to this perceptual hash route you want to go with, it sounds like you'd get a lot of benefit from the Hydrus Network.

God-damnit-all commented 4 years ago

Okay, actually I think I may have a better solution. Within the Twitter folder, have there be a .tweets folder, and then just have followed artist be a symbolic link to their folder. Because it's an 'inner' symbolic link, relative pathing should work on every OS.

fake-name commented 4 years ago

Could a distinction between users I'm following and not following be made for the folder structure? Within the twitter folder, a Following folder and a Retweeted folder, perhaps?

You still have the duplicate files issue.

I don't know. I just want to browse files with a file manager. I've never been overly fond of using web interfaces for everything.

If you asked me about 8-10 years ago, I'd have wholeheartedly agreed.

Once I started writing my own viewing tooling, I basically use a actual fs browser only for debugging.

That said, in regards to this perceptual hash route you want to go with, it sounds like you'd get a lot of benefit from the Hydrus Network.

I'm aware of it, but they don't deal with the sort of scaling issues I have (25 million plus images).

There's not really enough "there" there to be worth forking, rather then just understanding the problem domain and re-writing.

fake-name commented 4 years ago

Okay, actually I think I may have a better solution. Within the Twitter folder, have there be a .tweets folder, and then just have followed artist be a symbolic link to their folder. Because it's an 'inner' symbolic link, relative pathing should work on every OS.

Frankly, that's a bunch of work for effectively negative benefit, from my perspective (more FS crap to maintain that I'd never use).

Personally, I'm not going to implement that. Sorry!

God-damnit-all commented 4 years ago

You still have the duplicate files issue.

I phrased it badly then. I simply meant having a folder for people I do follow and a folder for people I don't.

fake-name commented 4 years ago

I phrased it badly then. I simply meant having a folder for people I do follow and a folder for people I don't.

Even with that, if you scrape a bunch of people, and then add some of them, you'd have two folders for the same person (one of their stuff that got RTed, and one of everything).

An approach there would be just moving the directory, but it's still a bunch of work. I think the energy would be better spent on making the web stuff work better instead.

God-damnit-all commented 4 years ago

An approach there would be just moving the directory

That's what I meant, yes.

I think the energy would be better spent on making the web stuff work better anyways.

The web stuff barely even works, it's a long ways off from being pleasant to use.

You know what, I give up. I didn't like how you dismissed all the information I had regarding scraping Twitter (only to arrive at the implementation I suggested in the first place), but this is just too frustrating.

I don't regret contributing what little I could, but I need my collection to be navigable so I can actually enjoy it. Best of luck with your project.

fake-name commented 4 years ago

The web stuff barely even works, it's a long ways off from being pleasant to use.

I don't disagree. At this point, I basically just use the patreon module, personally, and maintain the rest mostly out of FOMO.

I won't pretend it's good, and I'm certainly open to improvements (or someone bugging me to make improvements).

You know what, I give up. I didn't like how you dismissed all the information I had regarding scraping Twitter (only to arrive at the implementation I suggested in the first place), but this is just too frustrating.

I did? I apologize. FWIW, the part I thought I was dismissing was the need to log in, and the rest was helpful (it's how I got where I did). It turns out you don't need to log in to do ranged searches, which is nice.

I don't regret contributing what little I could, but I need my collection to be navigable so I can actually enjoy it.

I've been where you are. The issue is that the general archiving problem doesn't map to folder layouts, because websites are generally too multidimensional. There's no way to lay out directories that doesn't either wind up with either 1. duplicated files, or 2. missing data.

The reason I'm disinclined to accept things like https://github.com/fake-name/xA-Scraper/pull/72 (well, aside from the complexities of handling user-provided format strings) is that in my opinion, it's solving the wrong issue.

The way to handle this problem is to design a user interface that maps to the data, not to try to map the data to a rigid structure (folders). You basically have a complex dataset (tags, authors, date-ranges, source-sites, etc...) and all of them are valid ways to view things.

God-damnit-all commented 4 years ago

It turns out you don't need to log in to do ranged searches, which is nice.

The issue was actually that only safe results would be returned unless you were logged in, but it does seem you've found a way around that.

There's no way to lay out directories that doesn't either wind up with either 1. duplicated files, or 2. missing data.

Hard links.

You basically have a complex dataset (tags, authors, date-ranges, source-sites, etc...) and all of them are valid ways to view things.

If you're really looking to reinvent the wheel, why not stick everything downloaded into an extensionless file with a filename based off its perceptual hash, and then generate symbolic links into subfolders as needed?

The way to handle this problem is to design a user interface that maps to the data, not to try to map the data to a rigid structure (folders).

There's a lot of drawbacks to using a web interface, the biggest among them being having to design one, but aside from that?

  1. GPU-acceleration of web interfaces is absolute garbage
  2. Cycling through lots of high-res images would be very slow due to the lack of preloading
  3. No support for loading PSD, SAI, CLIP, or other file formats that aren't the standard affair
  4. More difficult to quickly share an image (unless you expose the very insecure service to the web) due to how bitchy browsers are about where you can use file:/// URIs (you could upload from your clipboard but you can easily hit a file size limit on services like imgur that you wouldn't have otherwise)
  5. No ability to generate thumbnails without using a combination of php and imagemagick/graphicsmagick, which is just not as good as what the OS could do

With all these drawbacks in mind, it seems very strange to treat using a file manager to browse files as though it's past its prime.

fake-name commented 4 years ago

The issue was actually that only safe results would be returned unless you were logged in, but it does seem you've found a way around that.

I somehow didn't put that together, and immediately got distracted by RAEG about how much I dislike twitter and required registration. I really apologize about that, I handled it really poorly.


If you're really looking to reinvent the wheel, why not stick everything downloaded into an extensionless file with a filename based off its perceptual hash, and then generate symbolic links into subfolders as needed?

Don't tempt me (well, except the symlinks bit). Have you seen what I do in my ReadableWebProxy project?

  1. GPU-acceleration of web interfaces is absolute garbage
  2. Cycling through lots of high-res images would be very slow due to the lack of preloading

Who says you have no preloading? https://github.com/fake-name/HTML5-Comic-Book-Reader has too much preloading, actually. On my crappy tablet, it can cause the browser to OOM for HUGE (500mb + manga archives) because there's a bug I've not been bothered to fix.

Also, this project was kind of my "come to jesus" experience for webshit. I used to insist on using hamana or cdisplay, but it's good enough that the web-view reader wound up completely displacing my use of filesystem-based viewers.

  1. No support for loading PSD, SAI, CLIP, or other file formats that aren't the standard affair

Is that an issue? Which site supports PSD?

  1. More difficult to quickly share an image (unless you expose the very insecure service to the web) due to how bitchy browsers are about where you can use file:/// URIs (you could upload from your clipboard but you can easily hit a file size limit on services like imgur that you wouldn't have otherwise)

Personally, I have literally never wanted to do this. I suspect this is a area where it's a consideration that'd never even enter my mind.

  1. No ability to generate thumbnails without using a combination of php and imagemagick/graphicsmagick, which is just not as good as what the OS could do

Uh, what? There are dozens of image processing libraries. I use pillow which is a robust image manipulation library that's a python C extension.

Also, any image-manipulation library is going to do a better job then the os compositor, because the OS compositor is generally tuned for speed, while you can manually specify higher-quality (but slower) scaling algorithms.

Really, I don't bother re-scaling server side. The manga reader just uses HTML5 canvas transforms (images are buffered at full size, and rendered to the viewer canvas at it's display size).

A assumption I do make (and am not really interested in reconsidering) is that the viewing client and the server are on a LAN or similar high-bandwidth link.


Don't get me wrong, I really don't like javascript, or having to write javascript, but it's surprisingly not terrible if done decently.

Note that this particular project is done, well, horribly. Most of the web stuff was one of the first things I wrote when I was learning JS about 8 years ago.