Erotemic / shitspotter

An open source algorithm and dataset for finding poop in pictures.
53 stars 4 forks source link

Doing God's Work - Download via IPFS may not be working as intended? #17

Open njho opened 9 months ago

njho commented 9 months ago

This is awesome, I'm also surprised that there is no shit dataset as well.

I'm in the mountains driving, so don't anticipate I can download the dataset anytime soon. Curious though, are the poos labeled?

Love it. I'ved told my girlfriend that this person's spent his last x amount of time taking pictures of poos - she'll never understand. In all seriousness though, the poos are labelled/annotated?

Erotemic commented 9 months ago

Yes, I've completed polygon annotations on all available data so far using labelme. The SAM AI assistant makes the process pretty quick, so I'll likely be able to keep up with it as I push out new data each month.

The labels are stored both in sidecar labelme json files, and are aggregated into the main kwcoco file whenever the dataset is updated. Currently there are 3609 polygons cross train and validation over 1964 images.

I'm not sure how accessible the data is via IPFS (part of this project is to determine how feasible IPFS is for managing and distributing datasets). One person has reported they've been unable to grab it, but I'm not sure if there error is on my side or theirs.

njho commented 9 months ago

Okay, let me take a look. I was travelling.

I'm unfamiliar with IPFS, but I'll report back with the results regardless!

Great work! Also checking out your profile. May reach out. You seem to have some interesting hobbies :)

njho commented 9 months ago

Hi!

So for an update, downloading via IPFS doesn't really work that well.

Debugging Process:

In short, I think that it's somewhat untenable to download via IPFS. Is there a way to download otherwise?

I'll leave it running overnight in hopes that it's able to resolve over time.

njho commented 9 months ago

I can edit the issue name as well, so it's more relevant

njho commented 9 months ago

No progress from running overnight. Any chance you'd be able to upload and host elsewhere?

Erotemic commented 9 months ago

Thank you for attempting to access the data, there is a discussion on the IPFS forums where I'm attempting to learn more and ideally get this working well: https://discuss.ipfs.tech/t/feasibility-for-self-hosting-scientific-datasets/17355

I want to make a push to figure out if IPFS can work for this. From everything I understand it should, but there may be a setup misconfiguration on my end. IPFS has a nice property that if you upload dataset A, and then upload dataset B, which is a superset of A, then when you download B it automatically recognizes and only processes the new data. That is ideal for distributing living datasets like this. If I hosted via traditional means, I would either need to upload the dataset piecewise and have the user reconstruct the entire thing on their end, or I would upload duplicate versions of the data. The other reason for IPFS is that it's free to use and (if it ends up working) it democratizes access to the data and allows others to seemlessly rehost.

As I'm typing this I checked my IPFS server, and it seems to be offline. I posted another issue with details here: https://discuss.ipfs.tech/t/error-hosting-data-on-rasberry-pi/17593

I also restarted the service so maybe it will work now? (Or maybe it will cause another OOM on my pi).

It also might be worth checking the original dataset upload: QmNj2MbeL183GtPoGkFv569vMY8nupUVGEVvvvqhjoAATG, which I believe is repinned by more than just me due to this reddit thread: https://www.reddit.com/r/DataHoarder/comments/rxtr65/ipfs_for_a_shitty_cause/

njho commented 9 months ago

Okay, so yeah, tried a couple things

CID: bafybeibxxrs3w7iquirv262ctgcwgppgvaglgtvcabb76qt5iwqgwuzgv4 (latest)

CID: QmNj2MbeL183GtPoGkFv569vMY8nupUVGEVvvvqhjoAATG

Maybe it is that no one else has fully synced it and you're RPI is the only source? I see it on the cloud-flare gateway though... Would they not also serve it? 🤔

Erotemic commented 9 months ago

My node still looks like it is online, I'm a little sad because I was hoping it was a configuration thing where I set some resource limit too high. Hmm... hopefully someone on the ipfs forums can help.

Clearly my node was working at some point, or it still is and for whatever reason the new CID doesn't work well, but it could also be the case that someone else served you the old dataset and my node or network configuration is wrong, but I'm really stumped. I've been debugging this on and off for 2 months now, and I still don't feel closer to the answer.

I haven't used IPFS desktop, but once you have them on your local node, its very quick to access the data directly on IPFS, or you can copy it to your regular filesystem. (It would be nice to have a way to avoid data-duplication costs of wanting to access the data, but also pin it on IPFS - I'm wondering how feasible an IPFS FUSE system would be).

I appreciate you working with me on this IPFS experiment. If you want to work with the data sooner rather than later, I can host the a snapshot of the json annotations on a centralized service, perhaps I could do that with the rest of the dataset too. If its not urgent, then I do plan on banging my head against this until it works or I have a better idea.

EDIT: Also try one more time. I also have the data on a Kitware IPFS node, which I just saw was offline, perhaps its reachable from there (they have a much better upload speed than my home network connection). I may have that node misconfigured too, but it's a data point.

EDIT Again: After thinking about it, I'm interested in the question of if IPFS is feasible / effective for distribution of scientific data. That means I should make the entire dataset available via a centralized source, so that's what I'm doing: https://data.kitware.com/#user/598a19658d777f7d33e9c18b/folder/65d6c52fb40ab0fa6c57909b

About 1/30 of the way through the upload, so it should be ready soon. Because the dataset is over 30GB, I've split it up across multiple zipfiles, but everything is in those including labelme annotations and the top-level kwcoco manifests / annotations.

I also have trained models that I will make available. They are not doing great, but they aren't completely terrible.

Currently getting loss curves like, so I probably need to keep tuning hyperparams. I don't like that spike.

val_loss

Qualitative validation heatmaps:

It gets a lot of easy cases, and has hints of getting it in difficult cases.

image

njho commented 9 months ago

Something's definitely changed!

I left my windows at work, so I'm using the CLI while IPFS desktop is loading the older dataset

ipfs get bafybeibxxrs3w7iquirv262ctgcwgppgvaglgtvcabb76qt5iwqgwuzgv4
image
njho commented 9 months ago

Interesting, yeah I'm thinking of trying to train a yolov8 model on it and see what comes out. What archi are you looking at for that? It's not a segmentation model but something else?

Thanks! Happy to keep on debugging this with you anyhow. And yeah taking a look through some of these samples - lots of very hard to spot poos

njho commented 9 months ago

As for spikes should be okay no? Might be just finding a new local minima? Fun to talk about this w/ other people :)

Erotemic commented 9 months ago

Spike could be ok, or they could indicate numerical instability. I would ultimately like to track more metrics, like effective rank, percent-dead-units, gradient magnitude, and weight magnitude,

The model is a semi-custom split-attention transformer (this split attention doesn't really mean anything in this case with a single modality and no time component), so it's just a semi-lightweight 4M param transformer. It has some bells and whistles to support remote sensing data, but it works just as well in the simpler normal-sized-image rgb-only case.

The training framework is geowatch: https://gitlab.kitware.com/computer-vision/geowatch The idea is you should basically be able to

pip install geowatch[headless]
geowatch finish_install  # gets some of the weirder dependencies like GDAL setup

and then use commands like python -m geowatch.tasks.fusion.train --help and python -m geowatch.tasks.fusion.predict --help. It uses pytorch-lightning on the backend, and it shouldn't be difficult to get a yolov8 model training if you can write a wrapper around it to accept the style of batches produced by the KWCocoVideoDataloader. The README in the above link goes over this information, and there is a lot more in the docs

EDIT: pretrained model is live: https://data.kitware.com/#user/598a19658d777f7d33e9c18b/folder/65d6ca07b40ab0fa6c5790a8

The technical overview is a good place to start.

I have tutorials that can be run end-to-end (generate toydata, train model, predict model) here: https://gitlab.kitware.com/computer-vision/geowatch/-/tree/main/docs/source/manual/tutorial?ref_type=heads Tutorial 1 is the most relevant for RGB phone images.

Invocations for my latest training runs are stored in: https://github.com/Erotemic/shitspotter/blob/main/train.sh

EDIT: Pretrained model is live: https://data.kitware.com/#user/598a19658d777f7d33e9c18b/folder/65d6ca07b40ab0fa6c5790a8

njho commented 9 months ago

Awesome, thank you for providing so many resources! My knowledge is definitely more surface level but I love learning so I can already tell I'll probably be spending a few days reading through this 😉 Thanks again

As well, download's still progressing 19.8% 👊

Erotemic commented 9 months ago

I think my node is online. I setup an entire new node from scratch and replaced the old one, I think something was wonky with the machine. I'm going to reformat it and repurpose it for something else. I also wonder if trying to use the accelerated DHT was causing issues. I made sure the new node was setup with minimal changes and the "lowpower" profile, but I did have to manually enter ipv4 addresses into the Addresses.AppendAnnounce. In any case, it seems to be passing checks on fleek, so maybe it's working?

Notes on the one extra config step I took ``` IPFS_PORT=4001 WAN_IP_ADDRESS=$(curl ifconfig.me) echo "WAN_IP_ADDRESS = $WAN_IP_ADDRESS" echo "[ \"/ip4/${WAN_IP_ADDRESS}/tcp/${IPFS_PORT}\", \"/ip4/${WAN_IP_ADDRESS}/udp/${IPFS_PORT}/quic\", \"/ip4/${WAN_IP_ADDRESS}/udp/${IPFS_PORT}/quic-v1\", \"/ip4/${WAN_IP_ADDRESS}/udp/${IPFS_PORT}/quic-v1/webtransport\", ]" echo "WAN_IP_ADDRESS = $WAN_IP_ADDRESS" ipfs config edit # Manually add above lines ipfs config --json Addresses.AppendAnnounce ```
njho commented 9 months ago

Yep I got full downloads of both the old and new datasets!

Just cause I don't really have a "node" set up yet, I'm not peering them to others yet, but I'll be setting up a node shortly here just to help!

njho commented 8 months ago

Ordering a new raspberry pi -> bricked my other one plugging in GPIO while it was on 🙄 . Will be hosting thereafter

njho commented 8 months ago

Hey trying to sync mine with the latest CID. Can you do the same thing to check if your node is up so I can sync? It's hanging again using ipfs get bafybeia2gphecs3pbrccwopg63aka7lxy5vj6btcwyazf47q6jlqjgagru

Erotemic commented 8 months ago

I'm curious what happens if you try to do a pin instead of a get. The following command should show some amount of progress, even if it is slow:

ipfs pin add --progress --name shitspotter-2024-02-29 -- bafybeia2gphecs3pbrccwopg63aka7lxy5vj6btcwyazf47q6jlqjgagru

The dataset is starting to get non-trivially medium-large and it seems IPFS does want to trace the entire tree even if a lot of it already exists on the machine (although duplicate downloads are still avoided).

If it still hangs at 0/0 for a long time let me know. I currently only have it pinned on my pi. I'm curious about its external visibility. According to ipfs-check and pl-diagnose on fleek it seems ok:

https://ipfs-check.on.fleek.co/?cid=bafybeia2gphecs3pbrccwopg63aka7lxy5vj6btcwyazf47q6jlqjgagru&multiaddr=%2Fp2p%2F12D3KooWCFcfiBevjQD42aRAELMUZXAGScRiN2NcAthokF4dMnVU

https://pl-diagnose.on.fleek.co/#/diagnose/access-content?addr=%2Fip4%2F172.100.113.212%2Ftcp%2F4001%2Fp2p%2F12D3KooWCFcfiBevjQD42aRAELMUZXAGScRiN2NcAthokF4dMnVU&backend=https%3A%2F%2Fpl-diagnose.onrender.com&cid=bafybeia2gphecs3pbrccwopg63aka7lxy5vj6btcwyazf47q6jlqjgagru

But if you aren't able to reach it, I can try pinning on another server that is more available.

njho commented 8 months ago

For whichever reason IPFS pin works significantly better. Syncing now. This node should be online all the time.

Will close the issue when it's completely synced. Thanks for the suggestion.

Why does pin work, but not get? Just at work and not enough time to google

Erotemic commented 8 months ago

I have no idea. When I tried, get works for me and I get immediate feedback and progress. My node has been online the entire time. It has an uptime of 19 days and there are no errors in the ipfs logs.

In any case, pinning will make subsequent updates much faster, although you will still have to do a "get" after the "pin" finishes to get access to the data on your filesystem. However, it will be much faster as it will construct the files from your local IPFS pin tree.