axflow / axflow

The TypeScript framework for AI development
https://axflow.dev
MIT License
1.09k stars 47 forks source link

Support additional formats #55

Open chrisbraddock opened 10 months ago

chrisbraddock commented 10 months ago

I have some unstructured text (paragraphs) as well as some JSON I could flatten to unstructured.

My other data sources are YouTube auto subtitles and "random" site scrapes.

Be amazing to not have to string all that gather/process work myself.

benjreinhart commented 10 months ago

Hey Chris, can you elaborate a little more on "random site scapes"?

For text we already have support for that. You can pass text directly to eg the splitter objects, or, use the TextDocument class when using the RAG pipeline.

Let me take a look at JSON and we can schedule the others in too.

chrisbraddock commented 10 months ago

Oh that's great thanks -- I swore I read something about only supporting markdown at the moment.

As far as "random site scrapes" I just meant there are several sites with data I'll eventually want to pull in.

I've see a few UIs that'll actually facilitate this, but it's pretty low on my list. I can manage that and massage the data.

With the JSON as well, not difficult to flatten, I just didn't realize you already had the text import support.

Definitely going to play this weekend, thank you!

benjreinhart commented 10 months ago

If you have a specific example of a JSON blob and the before/after that you would expect, that could help us make sure we're providing a solution that works for you

chrisbraddock commented 10 months ago

I can look later but I'm pretty sure it's not standard format or schema.

Might be cool to allow a transformer function in the UI for JSON (or others even). That way everything is in one place, no misc. scripts hanging around on the file system.

The flattening code I have is probably 15 lines.

benjreinhart commented 10 months ago

We definitely want to support some functionality in the UI, like the ones you mention (chunking, splitting, some data loading). However, for that to work well, we will have to extract some of that logic out of Axgen because Axgen is currently a server-side library as it deals with things like sensitive API keys / cloud credentials.

Good news! This work is already planned. We are going to be extracting some of the functionality out into packages that can be consumed independently of one another (or together to get biggest bang for buck)

Chigala commented 9 months ago

@benjreinhart does Axflow have a roadmap out there? I'll love to contribute!

benjreinhart commented 9 months ago

@Chigala currently we have not made a public one. We've been discussing internally what we want it to be for the next 2 or so months. Once we solidify that, I'd happily make a public version for those interested in contributing!

wadewegner commented 6 months ago

I would echo the request for a few other splitters.

I see support for Markdown, Text, and CSV. I'd like to see JSON (example for me would be to use it with GeoJSON). I don't know much about splitters in this context, but what strikes me as potentially interesting would be to break up the JSON based on attributes in GeoJSON, e.g., state, country, lat/long, hextiles, etc.

I think PDF support would be nice, along with URLs, but I understand if you would prefer the dev to pull the text and just use the Text splitter.

Thanks! Neat project.