Open wakatara opened 8 months ago
hi @wakatara - thank you for making this issue and trying out this feature!
If you have the opportunity, an MRE that doesn't involve your domain specific code would be helpful. If not, I can work on reproducing / debugging this when time allows
Hey @zzstoatzz, Lemme try and put something together. The issue is we're calling a lot of api endpoints we built (and then also wrap nasa apis) so I will try to come up with a MRE.
Also, please let me know how I can help diagnose this better if you don't have the time. I noticed someone in the #prefect-community slack had used a memory profiler in-code to demo a memory leak in another area, but I imagine this is what this is (it's far too consistent across worker numbers, diff directories, and times of day and week.).
(and once again, thanks for your help on getting it this far. I'm actually super excited to just puzzle out the crash because this thing is basically one after that is sorted... ).
@zzstoatzz I guess the other question is, is there another way to achieve this same result besides the webrunners? (which I have to admit because they are easily handled via docker-compose and work intuitively, at least imho.)
Effectively, I want to scale up a single pipeline which I'm not sure can be handled by .map()
in an aysnc fashion due to the need to pass multiple prior values (and unsure of the mechanism that map would use to keep those all straight in that case from the subflow. (so, this may just be my unfamiliarity with Prefect conventions, but does map keep those things in order if based on an initial filename (or filenames passed)?
So, in the below example, the task photometry_fit_submit is not async (tho probs easy to be).
ie.
photom_job_id = task.photometry_fits_submit(scratch, identity, photometry_type)
# the job is processing here and can take anywhere variable between 15s and 90s
# the job is checked by lots of retries and retry backoff jitter
photometry = t.photometry_fits(photom_job_id)
Thinking of our out-of-band chat on the async example you showed.
Set the flow parameter cache_results_in_memory
to False
. This will fix 90% of the issue. Although I do still think there is a memory leak
@bellmatthewf Oh wow, I jsut saw this message of yours. I had to fall back to Airflow since even with 2.19.0 I did not have this working, but will try again when I get some time to see if I can get this working with that. And yes, there is def a memory leak but need to spend some time figuring out how to profile it. It seems the web runner
is now a lower priority infra approach for Prefect (though have to say when it works, it works fantastically... 🥰).
I'm still committed to simplifying our ingest stack though. And do really really like the way Prefect works much better.
I opened a possibly related issue (#12668), including minimum example to reproduce. That may help you!
@bellmatthewf OMG, this look great. Thank you. Watching both issues now.
First check
Bug summary
I have a flow that calls a subflow using the new
submit_to_runner
feature. It's working great until I get to a certain point and then I have continual crashes around the 1100-1350 files processed mark (regardless of directories or files processed). The starnge thing is it is just the Prefect Server that says its crashed. Individual tasks running with the workers (50 at the same time) work fine. So, assuming it's a memory leak since i think I've ruled out almost everything else. (issue happens both with local dev and on my beefy server - running Prefect viadocker compose
setup)Due to the way the subflow needs to work it needs to run syncronously and (I don't believe I can use the aysnc
.map
mechanic because multiple parameters must be added to each call. Basically, it collects up a list of files and then sends them to the web runner subflow. The subflow calls all the tasks. Basically,As mentioned, this is working great and processes the science pipeline fine, but then the server process loses its mind around 1100-1350 files every time. (and well, I have ~8M files to process. 8-// ).
As mentioned, other than this everything else is working as it should and the
sci_backend
processes work fine synchronously (the preceding tasks are dependent upon previous results in a stepwise fashion ie. get_jpl_orbit, get_description, etc etc. The processes do have to wait significant amounts of time for results to come back in some cases (one of the apis is throttled to only provide one call at a time from each IP) but as mentioned, this works fantastically and robustly until it gets to about the 1100 files mark. Then the server process crashes and while I've already got the results in the database they're not complete.(the data is separated into directories by comet/small solar system body and even when I try to only analyze different directories I get the same result.).
It am suspecting a memory leak in the
web runner
feature (which is awesome, btw ❤️), but having refactored, error corrected, and even swapped out sqlite and postgres to see if there were various things about my setup, I'm a bit stumped as to how to diagnose the problem or what it could be (no logs get printed except for post-crash issues - so even a way to make the logs vastly more verbose would help though, as I said I suspect a memory leak issue.).I'm also open to if there are other methods to make this work besides the web runner feature.
Reproduction
Error
Versions
Additional context
This is porting an existing flow from Airflow which does work there in DAG format, so I am suspecting it's a Prefect issue.
Huge props to @zzstoatzz on the #prefect-community Slack who provided some guidance on this issue and suggested filing a GH ticket.