atlas-engineer / nfiles

User configuration and data file management
BSD 3-Clause "New" or "Revised" License
17 stars 5 forks source link

Enable buffered reading #1

Open aartaka opened 2 years ago

aartaka commented 2 years ago

Async reading is saving us in many places, but it does not save us from reading too much. For example, once my Nyxt history file exceeds 150MB (which it does every one-two months), it makes my Lisp image segfault, for there simply is not enough memory to process/allocate such an amount of text.

This may better be a feature request in cl-prevalence or SBCL, but this also concerns how nfiles work: should we somehow enable batch/buffered file reading, so that we don't have to allocate too much when reading from files?

Ambrevar commented 2 years ago

In the case of the Nyxt history, the entire file must be processed at once because it's a single s-exp, right? Can you think of another way around it?

How does crash Nyxt / SBCL? Is it a limitation of SBCL that it cannot read too-big s-expressions? Is there a way to recover from it?

What do you mean with "batch reading"?

Also you mentioned "async reading", is this related to asynchronicity?

aartaka commented 2 years ago

In the case of the Nyxt history, the entire file must be processed at once because it's a single s-exp, right?

Even a single s-exp can be broken into smaller pieces with a smart enough parser, I believe.

Can you think of another way around it?

Allocating more memory :D

How does crash Nyxt / SBCL? Is it a limitation of SBCL that it cannot read too-big s-expressions? Is there a way to recover from it?

It does not print any error messages, it simply segfaults, and it is not recoverable. The frequency/speed of segfaults correlates with the size of the history file. The problem most probably comes from allocating too much memory when parsing the object representation in cl-prevalence. s-serialization::deserialize-sexp is a recursive function with no tail call optimization, after all. Parsing a huge object is likely to clog the memory there :(

What do you mean with "batch reading"?

Oh wait, I misused the word it seems. What I meant is reading a big piece of text in small digestible pieces to not clog the memory too much.

Also you mentioned "async reading", is this related to asynchronicity?

Yes, it is. I meant the asynchronous reading/writing to files that nfiles enables.

Ambrevar commented 2 years ago

Allocating more memory :D

How do we do this?

s-serialization::deserialize-sexp is a recursive function with no tail call optimization, after all. Parsing a huge object is likely to clog the memory there :(

If the call stack is the problem, then I'm not to sure how to fix it since tail-call-optimization cannot be done here.

Maybe use static-dispatch and inline the methods if static-dispatch allows it?

First thing we'd need is data to reproduce. Do you think you can generate a history with random entries that reproduces it?

If you already have a history file that generates the issue, you could try replacing all the titles / URLs with dummy content to anonymize it.

By the way, this is the kind of operations we are likely to ask our users. Maybe write a Lisp/Nyxt script to automate this?

Yes, it is. I meant the asynchronous reading/writing to files that nfiles enables.

I understand the problem with memory, but I don't understand how it's related to asynchronicity. A few commits ago on master reading was synchronous and I believe the same memory issue exists there.

Can you clarify?

aartaka commented 2 years ago

Allocating more memory :D How do we do this?

It was just a joke about adding more --dynamic-space-size :)

s-serialization::deserialize-sexp is a recursive function with no tail call optimization, after all. Parsing a huge object is likely to clog the memory there :( If the call stack is the problem, then I'm not to sure how to fix it since tail-call-optimization cannot be done here. Maybe use static-dispatch and inline the methods if static-dispatch allows it?

The best way would be to optimize deserialize-sexp into iterative or proper tail-call-optimized version. What's static-dispatch?

First thing we'd need is data to reproduce. Do you think you can generate a history with random entries that reproduces it? If you already have a history file that generates the issue, you could try replacing all the titles / URLs with dummy content to anonymize it. By the way, this is the kind of operations we are likely to ask our users. Maybe write a Lisp/Nyxt script to automate this?

I already deleted it :( But yes, a script can be useful.

Yes, it is. I meant the asynchronous reading/writing to files that nfiles enables. I understand the problem with memory, but I don't understand how it's related to asynchronicity. A few commits ago on master reading was synchronous and I believe the same memory issue exists there. Can you clarify?

It's unrelated, sorry.

EDIT: complete answer.

Ambrevar commented 2 years ago

I'm not sure how to write a tail-call-optimized version of deserialize-sexp. Could you provide a draft?

Otherwise an interactive version would work.

Static dispatch is this: https://github.com/alex-gutev/static-dispatch