coreweave / dataset-downloader

MIT License
0 stars 2 forks source link

Issue downloading datasets #2

Closed parallelo closed 1 year ago

parallelo commented 1 year ago

Hi! I'm taking a look at CoreWeave's LLM documentation, and I'm trying to get GPT-J fine-tuned with the example dataset.

There seems to be an issue with the dataset-downloader -- my PVC does not get populated with the dataset.

I'm currently trying to understand if there's an issue with the path being passed to it, or if the internals of main.go aren't working correctly. Debugging this is a bit more involved because the dataset-downloader is running in a GitHub-produced distroless container, so it will take some extra effort to inspect the state during run time.

My K8s logs for the relevant pod only show this:

2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/140
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/0
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/20
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/40
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/60
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/80
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/100
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/120

I'd expect the logs to also show something like the following (from this Printf statement):

Downloaded XYZ to /data/finetune-data/dataset/xyz

Anyway, I'm still debugging this, but just wanted to reach out and see if you have suggestions in the meantime. Thanks in advance!

parallelo commented 1 year ago

Side note: I've been working to debug the distroless container using an ephemeral container.

However, it looks like CoreWeave's Managed K8s doesn't allow ephemeral containers:

$ kubectl run debug-download --image=ghcr.io/coreweave/dataset-downloader/smashwords-downloader:836cea3 --restart=Never -- sleep 1d
pod/debug-download created

$ kubectl debug -it debug-download --image=busybox:1.28 --target=debug-download
Targeting container "debug-download". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
Defaulting debug container name to [SNIP].
Error from server (Forbidden): pods "debug-download" is forbidden: User "[SNIP]" cannot patch resource "pods/ephemeralcontainers" in API group "" in the namespace "[SNIP]"
parallelo commented 1 year ago

Had some time to dig further -- just doing some note-taking here.

Let's try to simplify the setup by avoiding any Kubernetes unknown variables. Using a standalone golang:newest docker container as a more simple test, here's the expected output:

$ go run main.go --data_dir=/mnt/pvc/dataset
2023/01/19 01:45:48 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/140
2023/01/19 01:45:48 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/100
2023/01/19 01:45:48 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/20
2023/01/19 01:45:48 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/0
2023/01/19 01:45:48 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/120
2023/01/19 01:45:48 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/80
2023/01/19 01:45:48 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/60
2023/01/19 01:45:48 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/40
2023/01/19 01:45:48 Downloaded Mail Order Bride: Blinded By Love (Brides Of The West: Book 1) to /mnt/pvc/dataset/MailOrderBrideBlindedByLoveBridesOfTheWestBook1
2023/01/19 01:45:48 Downloaded Jackson: The Sons of Dusty Walker to /mnt/pvc/dataset/JacksonTheSonsofDustyWalker
2023/01/19 01:45:48 Downloaded The Treasure Bride to /mnt/pvc/dataset/TheTreasureBride
2023/01/19 01:45:48 Downloaded Mail Order Bride Margaret (Montana Destiny Brides, Book 1) to /mnt/pvc/dataset/MailOrderBrideMargaretMontanaDestinyBridesBook1
2023/01/19 01:45:48 Downloaded Stage West - David to /mnt/pvc/dataset/StageWestDavid
2023/01/19 01:45:48 Downloaded Cowboy Paradise to /mnt/pvc/dataset/CowboyParadise
2023/01/19 01:45:48 Downloaded Learning To Love (Carson Hill Ranch: Book 1) to /mnt/pvc/dataset/LearningToLoveCarsonHillRanchBook1
2023/01/19 01:45:48 Downloaded Ty Hard to /mnt/pvc/dataset/TyHard
2023/01/19 01:45:48 Downloaded Historical Cowboy Romance Two Book Box Set - Mail Order Brides to /mnt/pvc/dataset/HistoricalCowboyRomanceTwoBookBoxSetMailOrderBrides
2023/01/19 01:45:48 Downloaded Maralie to /mnt/pvc/dataset/Maralie
2023/01/19 01:45:48 Downloaded December Love to /mnt/pvc/dataset/DecemberLove
2023/01/19 01:45:48 Downloaded A Cowboy Sunrise to /mnt/pvc/dataset/ACowboySunrise
2023/01/19 01:45:48 Downloaded No More Tears to /mnt/pvc/dataset/NoMoreTears
2023/01/19 01:45:48 Downloaded Lost in Texas to /mnt/pvc/dataset/LostinTexas
2023/01/19 01:45:48 Downloaded Blame it on Texas to /mnt/pvc/dataset/BlameitonTexas
2023/01/19 01:45:48 Downloaded Revenge Requires Two Graves to /mnt/pvc/dataset/RevengeRequiresTwoGraves
2023/01/19 01:45:48 Downloaded Mending Fences (Texas Heat: Book 1) to /mnt/pvc/dataset/MendingFencesTexasHeatBook1
2023/01/19 01:45:48 Downloaded 3 Book Romance Bundle: "Love in the Jungle" & "Falling for the Bull Rider" & "Flown by the Billionaire" to /mnt/pvc/dataset/3BookRomanceBundleLoveintheJungleFallingfortheBullRiderFlownbytheBillionaire
2023/01/19 01:45:48 Downloaded Maddy's Oasis to /mnt/pvc/dataset/MaddysOasis
2023/01/19 01:45:49 Downloaded An Unexpected Widow (The Colorado Brides Series) to /mnt/pvc/dataset/AnUnexpectedWidowTheColoradoBridesSeries
2023/01/19 01:45:49 Downloaded Stage West - Dalton to /mnt/pvc/dataset/StageWestDalton
2023/01/19 01:45:49 Downloaded Ridge to /mnt/pvc/dataset/Ridge
2023/01/19 01:45:49 Downloaded Mail Order Bride: The Irish Runaway to /mnt/pvc/dataset/MailOrderBrideTheIrishRunaway
2023/01/19 01:45:49 Downloaded Mail Order Bride: Westward Winds (Montana Mail Order Brides: Book 1) to /mnt/pvc/dataset/MailOrderBrideWestwardWindsMontanaMailOrderBridesBook1
2023/01/19 01:45:49 Downloaded Texas Rose to /mnt/pvc/dataset/TexasRose
2023/01/19 01:45:49 Downloaded Forever His (A 30 Book Steamy Contemporary Romance Bundle) to /mnt/pvc/dataset/ForeverHisA30BookSteamyContemporaryRomanceBundle
2023/01/19 01:45:49 Downloaded Silver Heart (Historical Western Romance) to /mnt/pvc/dataset/SilverHeartHistoricalWesternRomance
2023/01/19 01:45:49 Downloaded Cowboys for Christmas to /mnt/pvc/dataset/CowboysforChristmas
2023/01/19 01:45:49 Downloaded Dream Kisses to /mnt/pvc/dataset/DreamKisses
2023/01/19 01:45:49 Downloaded Abby: Mail Order Bride (Unconventional Series #1) to /mnt/pvc/dataset/AbbyMailOrderBrideUnconventionalSeries1
2023/01/19 01:45:49 Downloaded Make Mine a Cowboy to /mnt/pvc/dataset/MakeMineaCowboy
2023/01/19 01:45:49 Downloaded Chase and Seduction to /mnt/pvc/dataset/ChaseandSeduction
2023/01/19 01:45:49 Downloaded Stage West - Lindsay to /mnt/pvc/dataset/StageWestLindsay
2023/01/19 01:45:49 Downloaded Violet's Mail Order Husband (Montana Brides #1) to /mnt/pvc/dataset/VioletsMailOrderHusbandMontanaBrides1
2023/01/19 01:45:49 Downloaded Texas Tornado to /mnt/pvc/dataset/TexasTornado
2023/01/19 01:45:49 Downloaded 3 Book Romance Bundle: "Her Last Love Affair" & "Loving Him Peacefully" & "Unwelcome Reunion" to /mnt/pvc/dataset/3BookRomanceBundleHerLastLoveAffairLovingHimPeacefullyUnwelcomeReunion
2023/01/19 01:45:49 Downloaded Hot in the Saddle to /mnt/pvc/dataset/HotintheSaddle
2023/01/19 01:45:49 Downloaded The Callahans (Prequel - Tempted By A Texan Series) to /mnt/pvc/dataset/TheCallahansPrequelTemptedByATexanSeries
2023/01/19 01:45:49 Downloaded Love on the Ranch to /mnt/pvc/dataset/LoveontheRanch
2023/01/19 01:45:49 Downloaded Western Romance: Cowboy Romance: Sally and Evan: Clean Slate (Western Historical Short Story Romance) to /mnt/pvc/dataset/WesternRomanceCowboyRomanceSallyandEvanCleanSlateWesternHistoricalShortStoryRomance
2023/01/19 01:45:49 Downloaded Kate to /mnt/pvc/dataset/Kate
2023/01/19 01:45:49 Downloaded At the Cowboy's Mercy to /mnt/pvc/dataset/AttheCowboysMercy
2023/01/19 01:45:49 Downloaded Romance on the Ranch to /mnt/pvc/dataset/RomanceontheRanch
2023/01/19 01:45:49 Downloaded Big Sky Blue to /mnt/pvc/dataset/BigSkyBlue
2023/01/19 01:45:49 Downloaded 3 Book Romance Bundle: "Loving The Bull Rider" & "Cowboy Down Under" & "The Escort Next Door" to /mnt/pvc/dataset/3BookRomanceBundleLovingTheBullRiderCowboyDownUnderTheEscortNextDoor
2023/01/19 01:45:49 Downloaded Stranded, Stalked And Finally Sated to /mnt/pvc/dataset/StrandedStalkedAndFinallySated
2023/01/19 01:45:49 Downloaded Mail Order Bride: Hannah's Dilemma to /mnt/pvc/dataset/MailOrderBrideHannahsDilemma
2023/01/19 01:45:49 Downloaded Perpetual Love to /mnt/pvc/dataset/PerpetualLove
2023/01/19 01:45:49 Downloaded Western Romance: Cowboy Romance: Love of A Good Cowboy (Western Historical Short Story Romance) to /mnt/pvc/dataset/WesternRomanceCowboyRomanceLoveofAGoodCowboyWesternHistoricalShortStoryRomance
2023/01/19 01:45:50 Downloaded Once Upon The Prairie (The Brides Of Courage, Kansas, Book 1) to /mnt/pvc/dataset/OnceUponThePrairieTheBridesOfCourageKansasBook1
2023/01/19 01:45:50 Downloaded Alma's Mail Order Husband (Texas Brides Book 1) to /mnt/pvc/dataset/AlmasMailOrderHusbandTexasBridesBook1
2023/01/19 01:45:50 Downloaded 3 Book Romance Bundle: "Loving His Cowgirl" & "Love, Forgiveness & Horseshoes" & "Loving the Escort" to /mnt/pvc/dataset/3BookRomanceBundleLovingHisCowgirlLoveForgivenessHorseshoesLovingtheEscort
2023/01/19 01:45:50 Downloaded A Cowboy's Love to /mnt/pvc/dataset/ACowboysLove
2023/01/19 01:45:50 Downloaded Contemporary Cowboy Romance 3 Book Box Set to /mnt/pvc/dataset/ContemporaryCowboyRomance3BookBoxSet
2023/01/19 01:45:50 Downloaded Cry of the West: Hallie (Finding Home Series #1) to /mnt/pvc/dataset/CryoftheWestHallieFindingHomeSeries1
parallelo commented 1 year ago

I found a workaround that downloads the dataset as expected into a K8s PVC.

The workaround was to roll a new docker image from scratch and then use it in finetune-download-dataset.yaml.

$ cat Dockerfile
FROM golang:latest
WORKDIR /app
COPY . .
WORKDIR /app/cmd/smashwords-downloader
RUN go build -o main main.go
ENTRYPOINT ["./main"]

That being said, it still isn't clear why the original image fails. I'll probably just use my workaround for the time being. Closing this issue.

wbrown commented 1 year ago

@parallelo Thank you for reporting this! We'll figure this out and update the Docker image and/or documentation accordingly.

wbrown commented 1 year ago

Completed.