Messages content not downloading in v0.4

swuckt commented 1 year ago

Today, testing on the same creator, I get different results with v0.3.5 versus v0.4

Scraper v0.3.5

Reported Total 986 Pictures 259 Videos 203 Duplicates declined Breakdown of Totals Messages: 97 Pictures 45 Videos Timeline: 614 Pictures 136 Videos Timeline Previews: 275 Pictures 78 Videos

Downloader v0.4

Reported Total 890 Pictures 193 Videos 2 Duplicates declined Breakdown of Totals Messages: 34 Pictures 24 Videos Timeline: 808 Pictures 139 Videos Timeline Previews: 48 Pictures 30 Videos

Only the Timeline download seems to be complete in v0.4 (the update to handle video served as .m3u8 is great)
Messages and Timeline Previews is finding less media in v0.4 than v0.3.5

Avnsx commented 1 year ago

What's the creator name and are you sure that the creator didn't just remove all that other content, or maybe you were subscribed to the creator before and during the scraping process with 0.4 you weren't

swuckt commented 1 year ago

These numbers are taken from the same day (today), maybe 45 minutes apart, so issues with being subscribed or content being removed shouldn't apply. I had this issue the entire month when I was still using the Python source, but was waiting to see if the latest release would fix it.

I can confirm that Message content hasn't been removed. Timeline Previews, I can't honestly confirm for you yet, but I can take the time to do that if you like.

It's mostly the messages that I'm after. v0.4 says it finds 58 message media that can be downloaded. Among them, only 3 out of 6 pictures that were sent to me.

But v0.3.5 detects all 142 message media and is able to get all 6 pictures.

Avnsx commented 1 year ago

I had this issue the entire month when I was still using the Python source, but was waiting to see if the latest release would fix it.

My god you're so annoying, I waited like 2 extra weeks for people to just report me issues like these. Then after I released a compiled executable (which is 10x the effort, without being able to digitally sign a executable file), you come reporting a bug you could've reported like 3 weeks ago when I first initially released the 0.4 version as python source. I can't believe you expect me to fix a bug I don't even know of and then the audacity you have to even come here and type this issue ticket out is insane.

swuckt commented 1 year ago

My god you're so annoying, I waited like 2 extra weeks for people to just report me issues like these. Then after I released a compiled executable (which is 10x the effort, without being able to digitally sign a executable file), you come reporting a bug you could've reported like 3 weeks ago when I first initially released the 0.4 version as python source.

Hey, I'm sorry. I wasn't intending to cause you distress. The nuances and difficulties of OSS development are unknown to me.

Clearly, I've messed up somewhere. I hadn't realised you were working on a certain timeline and needed issues reported ASAP. The delay is because I thought the issue was on my end, so I've been trying different machines and configurations infrequently. I didn't want to open up an issue unnecessarily.

I respect your work and commitment to updating this repo beyond what is expected.

I can't believe you expect me to fix a bug I don't even know of and then the audacity you have to even come here and type this issue ticket out is insane.

This is not my expectation at all. I have not made any demands and I am not here demanding a fix. Most of all, I'm absolutely not expecting you to solve all the problems on your own.

I thought this would be a safe place to open up a discussion. So far, I have only provided information.

I was hoping to get some information back.

But I'm feeling a little defensive right now. I'm going to take a deep breath and return when my head is clear.

Avnsx commented 1 year ago

Hey please check if this commit https://github.com/Avnsx/fansly-downloader/commit/115d549204268db8a5bf6d7bd5b425c8add1d71c fixes the issue you mentioned @swuckt

You can run the latest commit by installing the python version of fansly downloader: https://github.com/Avnsx/fansly-downloader#python-version-requirements

Avnsx commented 1 year ago

@swuckt Can you please respond with a simple "yes it fixes the messages bug" or "no, the bug is still not fixed for me", the heart emojis really don't help me fix all possible bugs atm.

swuckt commented 1 year ago

No, the bug is still not fixed for me

Avnsx commented 1 year ago

Okay so after downloading and utilising the latest commit, how did the download numbers change for version 0.4?

What creator are you even using Fansly Downloader on, to validate that there's less content being downloaded?

Can you maybe cut it down to a specific section having less content downloaded or is it just generally downloading less?

Can you provide me examples of media that the download is missing out on?

To help me verify that this is a genuine bug:

Visit the creator page.
Identify posts that appear to have download issues.
If you suspect a post is not being downloaded properly, change the download_mode to Single in the configuration file and attempt to download the content from that post. If the message states "1. duplicate declined," it means you have already downloaded the post previously. If it successfully downloads the content, it indicates that the downloaders functionality is working correctly. The only situation where a bug would be confirmed is if the message states "no scrapable media found" for the post you attempted to download from (and you can visibly see, that there's media attached to that post).
Inform me of the post IDs (click on a post, to see its post id in the url bar) for such posts. If these posts are not restricted to subscribers only, I can verify that there is indeed a bug. If the posts are accessible only to subscribers, I will require access to an account with the affected content in order to confirm that it is not being downloaded and for debugging purposes, so that I can ensure compatibility with the Fansly downloader. Send me a e-mail with account credentials to AvnDev@protonmail.com

swuckt commented 1 year ago

Since opening this issue, I've received and sent more pictures and videos.

I tried downloading using your new commit, and compared the contents of the ../Messages/Pictures folders.

The main difference is that earlier images (which were downloaded by v0.4) are no longer there. They've been pushed out by newer images.

Note that "newer" does not mean newer by file name, but newer as in when I received it. Some images have an older file name (i.e. from 2022) but were sent to me this week

This makes me think there is a datetime problem or maybe a limit on how far back you can look into messages for media.

When the program starts, the scrapable media count is already wrong. From my earlier counts above, it's less than half of what it should be.

After I get home, I'm going to add some output lines to see how accessible_media, contained_posts, and parse_media_info() work. I'll have a better idea then.

Avnsx commented 1 year ago

For messages it would be interesting if you had a long enough message history with someone that could verify this thesis: https://github.com/Avnsx/fansly-downloader/blob/2e85993d9ae02c09f1d66f6388b801f296f6b1e0/fansly_downloader.py#L1303-L1305 In version 0.3.5, I would just iterate over each content messages page in steps of 50, but after I realised they removed the max limit integer for limit (sadly only for messages) I just set it to 9999 and expected it to all be downloaded within a single iteration. Which works well for me.

Regarding your thoughts about the datetime might be the problem, here are some possibly influential factors:

In version 0.4 I decided to convert the timezone reported by fansly, to the datetime reported by the local systems timezone: https://github.com/Avnsx/fansly-downloader/blob/2e85993d9ae02c09f1d66f6388b801f296f6b1e0/fansly_downloader.py#L407-L429 I am using 24 hour format on my device and I wonder if my code above does properly work for people that don't.
parse_media_info() is a absolute bugfest. I am really bad at efficiently parsing the json API responses in any programming language & on top of that the fansly API is very unhandy and kind of randomly structured for various types of media & has alot of bugs, which made it even harder for me to properly & efficiently parse what I am looking for, in the API responses.
Fansly's API does in general not report the correct timestamps, so I am switching inbetween updatedAt and createdAt multiple times: https://github.com/Avnsx/fansly-downloader/blob/2e85993d9ae02c09f1d66f6388b801f296f6b1e0/fansly_downloader.py#L899-L913 If media reports wrong timestamps, it's most likely because it came from parsing updatedAt, I am doing that additionally because just using createdAt did not manage to provide unique enough filenames, so files would start overwriting each other. Maybe this is still a bug? Check if the output of Fansly Downloader is actually reporting the media IDs that are missing, but within the final download folders that media content is not existent (that would mean it has been overwritten with another file).

But yes if anything would be bugged out, it would most likely require a fix in parse_media_info() -> parses api responses or sort_download() -> downloads the media based on what parse_media_info() reports. Every section of fansly (Timeline, Messages, Collections etc.) is tunneled through those two functions.

Also I feel like for some reason the 0.4 version behaves differently on everyones device and I can't figure out why. There's things that just clearly work for me, but don't work for others e.g.: https://github.com/Avnsx/fansly-downloader/discussions/109 & https://github.com/Avnsx/fansly-downloader/issues/105#issuecomment-1594985511, https://github.com/Avnsx/fansly-downloader/issues/101#issuecomment-1589315921

swuckt commented 1 year ago

In version 0.3.5, I would just iterate over each content messages page in steps of 50, but after I realised they removed the max limit integer for limit (sadly only for messages) I just set it to 9999 and expected it to all be downloaded within a single iteration. Which works well for me.

This is probably the cause of the my issue. I may be past that limit. I printed out post_object['messages'][-1] to get the oldest message, and it's definitely not the first message I sent or received.

Thanks for sharing that!

https://github.com/Avnsx/fansly-downloader/blob/502e4caeeaee4c7f8fe8a9345b0aba24e4aca431/fansly_scraper.py#L351-L353

Guess I need to add something like this after the first iteration to move the cursor back.

Avnsx commented 1 year ago

Can you in version 0.4 set download_mode to Messages and then after this line: https://github.com/Avnsx/fansly-downloader/blob/2e85993d9ae02c09f1d66f6388b801f296f6b1e0/fansly_downloader.py#L1305

Add:

from pprint import pprint
reachable_media = messages_req.json()['response']['accountMedia']
# pprint(reachable_media, indent=4, width=100)
print('\nRequested url:', messages_req.url)
print('\nTotal length of reachable items:', len(reachable_media))

print('\nThe most distant message in the past:', get_adjusted_datetime(reachable_media[-1]['createdAt']))

print('\nRequest Status Code:', messages_req.status_code)
print('\nResponse Headers:', messages_req.headers)
exit()

save the python file and then run the code with those changes on a creator (you've to change the Username variable in config.ini) who you have the most messages (that contain content and reach far into the past) with.

Then copy paste the output here, letting me know when the first message that contained media content was actually at and what date it said that it would be in the python output.

Also it's important that it says status_code 200 and reports about as much content as you actually have in there.

Further more, you can uncomment (remove #) the # pprint(reachable_media, indent=4, width=100) and it will actually show you the whole thing that it can parse, whereas the further in the past being content should be basically at the very bottom of the python output and the very latest media content should be at the very top of the python output.

swuckt commented 1 year ago

Requested url: https://apiv3.fansly.com/api/v1/message?groupId=534534320386322432&limit=9999

Total length of reachable items: 62

The most distant message in the past: 2023-05-14_at_15-21

Request Status Code: 200

[...] letting me know when the first message that contained media content was actually at and what date it said that it would be in the python output.

I can't remember the exact date, but transaction history puts it at March 24th 2023, which doesn't match the Python output.

it will actually show you the whole thing that it can parse

If I understand what I'm seeing, is this roughly the same as post_object['accountMedia'][-1]? Except with media in different resolutions.

Avnsx commented 1 year ago

Do you not maybe have a chat history with some random creator, who you happend to follow back in 2021 / 2022 and ever since then the person kept sending you those spammy messages from time to time?

Because what you posted before is not distant in the past enough and I can't tell if you actually only got 62 media items during that time span or not. And the most distant timestamp thing is not 100% accurate because fansly doesn't report timestamps correctly anyways, so there might be like a couple month difference, which is why I need you to try on someone as before explained which dates back to other years.

swuckt commented 1 year ago

save the python file and then run the code with those changes on a creator (you've to change the Username variable in config.ini) who you have the most messages (that contain content and reach far into the past) with.

I think I am not understanding the conditions you want me to test.

Is it more important that there are lots of messages, or that it reaches far back into the past? Or both?

If you are trying to test the 9999 limit, then time wouldn't matter, since, theoretically, it's possible to send 9999 messages within 1 week. Or I could also be misunderstanding what the limit means.

Do you not maybe have a chat history with some random creator, who you happend to follow back in 2021 / 2022 and ever since then the person kept sending you those spammy messages from time to time?

I opened my account March 19 this year.

I can't tell if you actually only got 62 media items during that time span or not.

That sounds correct for the time frame. Note that the media item count includes outgoing pictures and video too. I'm going to browse the response object and cross reference it with known dates, and see if I can verify or disqualify the timestamps.

You should correct my thinking here, but my interpretation of the numbers is as follows:

I run the code and hit the 9999 limit. This means that messages_req.json()['response']['accountMedia'] cannot reach back far enough into the past, and the response only includes past messages starting in May.
As I continue to send messages and receive messages with no media, the Total length of reachable items should decrease

I have a busy day, so my next response will be very late.

Avnsx commented 1 year ago

Can you switch to the python source I uploaded into this repository? It basically reverts the messages change i did for 0.4 and replicates how it was in v0.3.5: https://github.com/Avnsx/test-repository/blob/main/fansly_downloader.py

Just press the "Copy raw file" button and paste it into your current python version of 0.4 and let me know if that fixes your issue or not.

Also why did you randomly point out in first place that the messages change, was your cause of the issue, if you initially named the issue ticket "Latest v0.4 release finding less media than v0.3.5" and pasted stats where previews would also download way less.

swuckt commented 1 year ago

... let me know if that fixes your issue or not.

The Messages issue is fixed. Thank you!

It may even be performing better. It's able to find an Audio media that Scraper 0.3.5 couldn't find, even though it existed back when I opened the issue.

(btw you changed the capitalization of the file and your link is giving me a "404 - page not found", but I figured it out)

Also why did you randomly point out in first place that the messages change, was your cause of the issue...

A few reasons:

It seemed the most natural way to do it at the time. If the newest release doesn't match the features of the previous one, that seems to be an appropriate issue to raise.

The fact is, 0.4 finds less media. If it happens for me, it may be an issue for all users. My personal preference for Messages shouldn't stop me from reporting the Timeline Previews issue.

Or do you mean I should separate it into two issues?

That's not to say I don't care about the Timeline Previews issue. But, one thing at a time.

Avnsx commented 1 year ago

Or do you mean I should separate it into two issues?

Yes, that is what I meant.

Ok so scraping from messages is fixed, but there's still less previews being downloaded from timeline?

Can you name a creator for which I can verify & debug this with?

Or better give me the post ids, which contain previews that you could download with 0.3.5, but can't in 0.4

Avnsx commented 1 year ago

You're aware that I can't fix the previews downloading issue if you don't tell me the creator name right? @swuckt

swuckt commented 1 year ago

Thanks for your patience. I had busy work days.

I'll update this post with the other details if possible.

Avnsx commented 1 year ago

User is sexyflo,werwater

Can't replicate any downloading issues, but neither do I think that's the correct username that you were originally complaining about, because in your initial issue message you mentioned completly different numbers for downloaded content in 0.4 vs 0.3.5.

Regardless I couldn't care less if you name me the correct creator name or not, you're the one that is going to not be able to download content afterwards. Other people used the scraper on a bunch of creators too and no one said anything about preview content missing in timeline. Just don't come back to me in a month saying, you knew about this bug before, but magically expected it to get fixed.

Finally I released various commit which introduce a new module called rich (need to install with pip install rich), it is used to display loading bars, specifically on content that is bigger in file size now. Additionally it fixes various bugs .

Would be nice of you, if you downloaded the latest python version and helped me test it.

swuckt commented 1 year ago

I'll have the details later today.

I'd be happy to test it. Is there anything specific you want done?

@Avnsx

That sent me on a journey.

Things I tried:

Downloaded from creator using Scraper v0.3.5 and v0.4. Twice. The second time was from a clean folder after unzipping the v0.3.5 and v0.4 Releases and going through the configuration again.
I wrote out the filenames in each folder to a .txt file, did some cleanup, and compared them in a spreadsheet.
Separated video files by name into Year-Month folders, to make it easier to compare and see what is missing

Link to spreadsheet: https://docs.google.com/spreadsheets/d/1fMRbrjwhKNKQJypq0h1PbGkbc7C3VP1TxKIZuHLmea8/edit?usp=sharing

From the spreadsheet, and from browsing the downloaded files by hand, here are my findings about v0.4:

Downloader v0.4 is not finding any previews (both pictures and videos)
Several datestamps are very wrong, but tend to "bunch up" on the same wrong date
Creator's content only goes back to 2022-05, but filenames contain dates like 2022-04 and 2022-03. In contrast, Scraper v0.3.5 usually has the correct datestamps
The most obvious naming errors occur between 2022-05 and 2022-08
between dates 2022-05 and 2022-08, non-matching amounts of videos (v0.3.5 downloads 138 files, v0.4 downloads 80 files)
between dates 2022-09 and 2023-06, matching amounts of videos (100 files), and matching content (verified by eye)

Unrelated findings/Separate Issues:

Individual picture and video IDs (in the filename) don't match any of the individual IDs from v0.3.5 (is that supposed to happen?)
A handful of newer videos downloaded with v0.4 are not hashed (see spreadsheet > TimelineVideos, column L)
Still have the issue where .jpegs are downloaded as .pngs

Next steps:

I manually compiled all the Post Links / Post IDs between 2022-05 and 2022-08. Presumably, 58 (or 59) of those don't work. Have to find out which ones, but have to save it for another day. Post IDs available in the spreadsheet
Perhaps finding less previews has something to do with the change from #80

Avnsx commented 1 year ago

Please tell me you used the python 0.4.1 version by downloading it as a zip from the repository and not the 0.4 version linked within the releases page. The 0.4 version was outdated at this point, I had fixed a similar issue where some post content would not be downloaded.

The dates being wrong is also whatever, I don't really care. But I do believe you when you say that 0.3.5 had more accurate dates, as I was mentioning before I am not exactly the best at parsing json responses.

A handful of newer videos downloaded with v0.4 are not hashed (see spreadsheet > TimelineVideos, column L)

That's normal, .m3u8 videos do not get hashed as they're being downloaded, due to the way .m3u8 videos are structured (they come in ts chunks and fansly downloader utilises the local GPU to transcode & merge them together into a actual .mp4 video). They do get hashed & the hashes are appended to filenames, the second time you start fansly downloader to update a previous download folder.

Still have the issue where .jpegs are downloaded as .pngs

This is the only interesting issue to me, because that should've been fixed in commit: https://github.com/Avnsx/fansly-downloader/commit/6b0c8d56f54145ea87002ea15f506ca933660d1d Can you point out a post ID which still has images downloading as videos?

Finally I'm closing this issue ticket, because as far as I comprehend the situation;

I fixed your initial issue with messages not being downloaded
You're the only one that still has this timeline previews not downloading issue
All the post ids you linked are locked behind a paywall, I still can't tell if they're actually not being downloaded and even if they weren't being downloaded, I still can't debug it because I refuse to pay for that. Which is why like 50 messages before I wrote a good step by step guide that could've saved you and me tons of time

Avnsx / fansly-downloader

Messages content not downloading in v0.4 #104

Scraper v0.3.5

Downloader v0.4