add INSSEQ info for inserted_sequence

soymintc commented 1 year ago

I am thinking of running something downstream that requires inserted sequences to infer amino acid sequence changes. I just added the INSSEQ ad hoc but please feel free to modify the PR to your liking.

soymintc commented 1 year ago

@helrick Gotcha interesting I didn't notice this before. Anyways got it fixed

soymintc commented 1 year ago

Added a Dockerfile to avoid switching to python 3.9 environments every time

waltergallegog commented 11 months ago

Hi @helrick, I'm also interested in this feature. Have you already implemented it in the current version of SAVANA?

If not, do you think the solution proposed in this pull request is still valid for the current version of SAVANA? (moving it to line 220) https://github.com/cortes-ciriano-lab/savana/blob/9b65935e1486034fc6aea9f72c2365d134b5a509/savana/core.py#L220 Thanks

soymintc commented 11 months ago

@helrick @waltergallegog Sorry for my delayed review on this PR; I have been quite content with the previous SAVANA version that I didn't care to pull the recent main branch. I've now resolved the conflict requested; actually it's just about adding some lines for INSSEQ so wasn't too difficult.

soymintc commented 11 months ago

@helrick I guess printing the inserted_sequence could be an option since it will increase the line length significantly. But since otherwise it's impossible to figure out what the inserted sequence is, I would vote that this indeed is a feature worth adding :)

waltergallegog commented 11 months ago

Hi @soymintc I just tested and the proposed change alone does not work with the new version. It fails in the classification step, as the new SVINSSEQ field is not present in the data used for training (I think):

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- SVINSSEQ

I have added a new modification on file classify.py, function get_info_fields(vcf) to ignore the SVINSSEQ field when performing inference:

def get_info_fields(vcf):
    """ initialize the data header from INFO fields of sample VCF """
    fields = []
    for rec in vcf.header_iter():
        if rec['HeaderType'] == "INFO":
            field = rec.info()['ID']
            if field != 'SVINSSEQ':
                fields.append(field)

It seems to work, I will push the modifications to my fork of SAVANA when I finish testing.

helrick commented 11 months ago

Hi both, thanks for the comments + PR. I agree that having the inserted sequence is a valuable feature, thanks @waltergallegog for the updates to @soymintc's proposed change. I should be able to test shortly.

However, I do agree with @soymintc that it may increase the size of output VCFs as well as decrease readability, especially when the inserted sequences are very long. One potential solution to this might be to have a "inserted sequence id" in the VCF file (i.e. ins_seq_01) if the length is over a certain threshold. Then a FASTA file with the inserted sequences could be included in the output directory containing for example:

> ins_seq_01
ATTCATCCATGGATAGAC...

Would the feature implemented in this way work for your use cases?

waltergallegog commented 11 months ago

Hi @helrick thanks for the quick answer. Regarding your proposed solution for me it would work without problems.

soymintc commented 11 months ago

@helrick Agreed! Actually that'd the way to go; this branch just started for me as an ad hoc workaround, so if that's implemented, I figure this branch has no other utility. @waltergallegog If you have urgent needs you could use the recent update I commited in the same PR, but if not I'd wait for Hillary's update :)

waltergallegog commented 11 months ago

Hi @helrick, regarding the inserted sequence and length, do you have any plans to improve the selection of it?

For example in the insertion below, there are 29 supporting reads, with insertion length ranging from 166 to 343. (The variance in length is caused repetitive AAA..AAAA of different lengths in each read)

Savana currently chooses as representative the read with the longest insertion, of length 343. From a quick inspection this reads seems more of an outlier to me, as most of the reads show insertion size in the range 240 - 300.

Maybe outputing as inserted sequence a consensus, or a more representative read could be better? (Let me know if I should move this discussion to a new issue).

cortes-ciriano-lab / savana

add INSSEQ info for inserted_sequence #18