pathFollowers need to reference multiple jsons

thejasonhsu commented 1 year ago

If your follower list is huge, you'll receive followers_1.json, followers_2.json, etc. I'm a Go noob so my current workaround is manually combining _2.json, _3.json, etc. into _1.json.

cecobask commented 1 year ago

Hi @thejasonhsu, Thank you for reporting that issue! Can you confirm what is the maximum number of objects in the first json file? In other words - at what point subsequent files are being generated for your followers data?

I can update the code to account for that in the coming week, but feel free to open a pull request if you’re interested.

thejasonhsu commented 1 year ago

Np. Won't be submitting PR since I can't fix it via Go haha.

followers_1.json ends on line 130002. Take away [], we have 130k. Each account entry makes up 13 lines. So exactly after 10k entries, followers_1.json ends. The 10001st follower will show up as the 1st entry in followers_2.json.
Don't need to account for this for following.json because it caps at 7.5k.

cecobask commented 1 year ago

Hi @thejasonhsu, I've addressed the issue in the latest commit. I tested this locally with some dummy data - it should work fine for you too. Could you please sync the latest changes to your fork and confirm whether it works as expected?

thejasonhsu commented 1 year ago

It still only reads the first json.

cecobask commented 1 year ago

The application extracts the zip file that you provide to a folder called instagram_data. Afterward, it assumes the following structure:

$ tree instagram_data
instagram_data
└── followers_and_following
    ├── followers_1.json
    ├── followers_2.json
    ├── following.json

There can be an infinite number of followers_n.json files in your followers_and_following folder. Here are the contents of my dummy files:

$ cat instagram_data/followers_and_following/followers_1.json
[
  {
    "title": "",
    "media_list_data": [

    ],
    "string_list_data": [
      {
        "href": "https://www.instagram.com/test1",
        "value": "test1",
        "timestamp": 1664195508
      }
    ]
  }
]

$ cat instagram_data/followers_and_following/followers_2.json
[
  {
    "title": "",
    "media_list_data": [

    ],
    "string_list_data": [
      {
        "href": "https://www.instagram.com/test2",
        "value": "test2",
        "timestamp": 1663444411
      }
    ]
  }
]

The application adds all unique elements from the followers_n.json files to a list and the final result is the following:

test1
test2

The number of elements in the JSON arrays shouldn't matter because the application will work in the same way regardless of the array sizes. Let me know if you think the outcome should be different. It would be helpful if you can confirm the folder structure is the same for you.

thejasonhsu commented 1 year ago

Your assumption and design are correct. I can confirm that my folder structure follows the design.

$ tree instagram_data
instagram_data
└── followers_and_following
    ├── followers_1.json
    ├── followers_2.json
    ├── following.json

Right now it only reads followers_1.json.

First run before e3e301c - failed. Only followers_1.json was used to compare with following.json. Expected results.
Second run before e3e301c - success. The workaround was manually combining followers_1.json with followers_2.json into just followers_1.json. Expected results.
Latest run after e3e301c - failed. Now that the code should combine infinite followers_#.json into followers_n.json, I restored followers_1.json into followers_1.json and followers_2.json. Unexpected results.

My assumption is followers_n.json contains only followers_1.json at the moment. Is there a way to read/debug this file during/after workflow?

cecobask / instagram-insights

pathFollowers need to reference multiple jsons #1