Retrieve Full-Texts from Sinequa Dev Servers

Description

The existing url import code that brings urls into cosmos cannot support our downstream ML tasks. To do that, we need full texts. Afaik, full texts cannot be retrieved via the query endpoint, only the sql endpoint.

Work was started previously on:

https://github.com/NASA-IMPACT/COSMOS/issues/1016

However, this card was too broad, and we are breaking it into smaller chunks.

Implementation Considerations

use the engine.sql endpoint to get all existing metadata from the dev servers
use the engine.sql endpoint to get full_texts from the dev servers
store the incoming full_text in a new CandidateURL field called full_text
how will we do error handling?
Tests: In order to really test the important bits, we would need to emulate a sinequa server, which we are not going to do. Therefore, it is probably not worth it to make any tests right now.
We should be using tokens, similar to config_generation/minimum_api.py. The actual code will referecnce an environment variable. The token will be put into this file on local, and onto the server when we deploy. Sorry, it goes in .django local file.

Open Questions

Credentials: for local development, we will use Li's server
once it goes into staging, it should use existing environment variable?

Deliverable

dropdown menu
updated import script
new field
data migration

### Tasks
- [ ] https://github.com/NASA-IMPACT/COSMOS/issues/1075

NASA-IMPACT / COSMOS

Retrieve Full-Texts from Sinequa Dev Servers #1071

Description

Implementation Considerations

Open Questions

Deliverable