almeidasilvaf / syntenet

An R package to infer and analyze synteny networks from protein sequences
https://almeidasilvaf.github.io/syntenet/
21 stars 6 forks source link

Cant seem to pass check_input #14

Closed iaindhay closed 1 year ago

iaindhay commented 1 year ago

I have been trying to get syntenet working but i cant seem to get past the check_input - check_list_name stage. My understanding is that it is comparing the header/name of the proteins in the fasta/AAString object with the "gene_id" in the GRange object (from column 9 of the gff file). But i cant seem to get it to process even when they are identical. The files i am working with are user generated (i.e not for a database) and we have the fasta header as the protein ID and after failing to get it to pass we have made every flag in the gff column 9 to be the same protein ID. I have tried to use the "gene_field" option in check_input and set it to any other column in the GRange object but it doesn't seem to help. I have tried to remove the ".1" form all the names in both fasts and gff files and doesn't change. Any help appreciated.

Im using the current version on bioconductor - 3.17

> aastringsetlist
$processed_NC_004578_cds_protein_sequences
AAStringSet object of length 27:
     width seq                                                                                  names               
 [1]   493 MNREEFLRLAADGYNRIPLARETLADFDTPLSIYLKLADQP...VQAGGGIVADSVPALEWEETLNKRRAMFRAVALASQTAEG WP_007244435.1
 [2]   640 MTKTSRCWPFAACLLSLACGTATAGPYSTMVVFGDSLADAG...ATFGVSQKLTQDLTLRGNYNWRKNDDVTQQGVNVALSMSF WP_011103197.1
 [3]    80 MNEILDQLRKEFATPCPSLSAVRERYFSHLSNDRNLLRKINAGRIALKVSRTGGTRQGHPFVYLHDLANYLSDIVTNRAA     WP_046463953.1
 [4]    82 MLDQPPGNSQHDAYLALAQRIQDAIASDKAQIEHQVLLIREPGESAAHWDHIVDQISEAEGIVVTRSPENGTAHVSWYIDSL   WP_005768037.1
 [5]   203 METQEISQEEMAERMGVTPGAVGHWLNGKREPKIEVINRLL...KLVEDAGRYFLKPLNPAYPTLPVTEDCKLIGVIRQMTMRL WP_005768035.1
 ...   ... ...
[23]   336 MASTASYEHVLHRYPSVQEWLELLGNLGRAPATLEAYGRGL...RDPKTTLIYVHLSGADLTARLAHSVGSLDARLFAELFKSE WP_011103210.1
[24]   713 MSYVPFDVDHYERQEELSDLERTILSNRRYRSDWAYLQSSV...DERAIVEGDLAKLDGLIRKLDDVPTLDGRTPSQIEAKKNR WP_011103211.1
[25]   218 MITPSRYPGIYIAPLSNEPTAAHTFKEQAEEALDHISAAPS...EELRAVGLDKYRYSLTKKPSENSIRAEHGLPLRMKYRAHQ WP_011103212.1
[26]   241 MPDPAQFSDGRWKKLPTQLSSITLARFDQDICTNNHGISQR...DPNLGEFHTHSKALADTIENISSADGLPLIGVQVFASKIH WP_226992679.1
[27]   199 MLLMIDNYDSFTYNVVQYLGELGADVKVIRNDELTIAQIEA...LRHKTLNVEGVQFHPESILTEQGHELFANFLKQSGGHRQG WP_005767895.1

$processed_NC_005773_cds_protein_sequences
AAStringSet object of length 21:
     width seq                                                                                  names               
 [1]   493 MNREEFLRLAAEGYNRIPLARETLADFDTPLSIYLKLADQP...VQAGGGIVADSVPVLEWEETLNKRRAMFRAVALASQPAEG WP_011167608.1
 [2]   640 MTKTSRRWPFAACLLSLACGTAAAAPYSTMVVFGDSLADAG...ATLGVSQKLTQDLTLRGNYNWRKNDDVTQQGVNVALSMSF WP_011167609.1
 [3]    80 MNEILDQLRKEFATPCPSLSAVRERYFSHLSNDRNLLRKINAGRIDLKVSRTGGSRQGHPFVYLHDLAKYLSAIVTNRAA     WP_004643929.1
 [4]    82 MLKQFPDTSQHDAYLALAQRIQDAITGDKAQIEHQVLLIREPGESVAHWERIMDQISEAEGISVTRNPENGTARVSWYIDSL   WP_004642420.1
 [5]   203 METQQISQEEMAERMGVTPGAVGHWLNGKREPKIEVINRFL...KLVEDAGRYFLKPLNPAYPTLAVTEECKLIGVIRQMTMRL WP_004663805.1
 ...   ... ...
[17]   391 MALTDLKIRQAKPGKTSSKLTDSGGLYLEVTTGGSKLWRYR...QLAHVEQKKSKAAYNHASYLPARKALMQWWDNYIFGSESD WP_011167620.1
[18]   100 MSVVKLFDTATSPVDVFDEVVNSGIAGVYGGRLAVREVASQ...SSGLTHNHSRTELAYMFDDRKFSVDQALDAVEKTYDITFP WP_041924422.1
[19]    66 MSDHDIDILIKLPEVCRQAGFGKSTIYELIAAGTFPAPTKLGRFSRWSQKEVQDWIELQKLARFAA                   WP_011167621.1
[20]   210 MELKRDPMLEQVVLPLLGNGDMIMSWQGVRADESINRRYLP...GIQYDLMIATDATACSSAYGLCDSGADGFNDTNVQLGEAA WP_081002664.1
[21]   199 MLLMIDNYDSFTYNVVQYLGELGADVKVIRNDELTIEQIEA...LRHKTLNVEGVQFHPESILTEQGHELFANFLKQSGGHRQG WP_002551940.1

> grangeslist
GRangesList object of length 2:
$processed_NC_004578_cds_features
GRanges object with 27 ranges and 8 metadata columns:
          seqnames      ranges strand |   source     type     score     phase             ID           Name
             <Rle>   <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>    <character>    <character>
   [1] NC_004578.1      1-1482      + |       NA     gene        NA         0 WP_007244435.1 WP_007244435.1
   [2] NC_004578.1   1704-3626      - |       NA     gene        NA         0 WP_011103197.1 WP_011103197.1
   [3] NC_004578.1   3733-3975      - |       NA     gene        NA         0 WP_046463953.1 WP_046463953.1
   [4] NC_004578.1   4097-4345      - |       NA     gene        NA         0 WP_005768037.1 WP_005768037.1
   [5] NC_004578.1   4498-5109      - |       NA     gene        NA         0 WP_005768035.1 WP_005768035.1
   ...         ...         ...    ... .      ...      ...       ...       ...            ...            ...
  [23] NC_004578.1 20673-21683      + |       NA     gene        NA         0 WP_011103210.1 WP_011103210.1
  [24] NC_004578.1 21707-23848      + |       NA     gene        NA         0 WP_011103211.1 WP_011103211.1
  [25] NC_004578.1 23950-24606      - |       NA     gene        NA         0 WP_011103212.1 WP_011103212.1
  [26] NC_004578.1 25258-25983      - |       NA     gene        NA         0 WP_226992679.1 WP_226992679.1
  [27] NC_004578.1 27745-28344      + |       NA     gene        NA         0 WP_005767895.1 WP_005767895.1
                names        gene_id
          <character>    <character>
   [1] WP_007244435.1 WP_007244435.1
   [2] WP_011103197.1 WP_011103197.1
   [3] WP_046463953.1 WP_046463953.1
   [4] WP_005768037.1 WP_005768037.1
   [5] WP_005768035.1 WP_005768035.1
   ...            ...            ...
  [23] WP_011103210.1 WP_011103210.1
  [24] WP_011103211.1 WP_011103211.1
  [25] WP_011103212.1 WP_011103212.1
  [26] WP_226992679.1 WP_226992679.1
  [27] WP_005767895.1 WP_005767895.1
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

$processed_NC_005773_cds_features
GRanges object with 21 ranges and 8 metadata columns:
          seqnames      ranges strand |   source     type     score     phase             ID           Name
             <Rle>   <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>    <character>    <character>
   [1] NC_005773.3      1-1482      + |       NA     gene        NA         0 WP_011167608.1 WP_011167608.1
   [2] NC_005773.3   1823-3745      - |       NA     gene        NA         0 WP_011167609.1 WP_011167609.1
   [3] NC_005773.3   3851-4093      - |       NA     gene        NA         0 WP_004643929.1 WP_004643929.1
   [4] NC_005773.3   4215-4463      - |       NA     gene        NA         0 WP_004642420.1 WP_004642420.1
   [5] NC_005773.3   4618-5229      - |       NA     gene        NA         0 WP_004663805.1 WP_004663805.1
   ...         ...         ...    ... .      ...      ...       ...       ...            ...            ...
  [17] NC_005773.3 16070-17245      + |       NA     gene        NA         0 WP_011167620.1 WP_011167620.1
  [18] NC_005773.3 17245-17547      + |       NA     gene        NA         0 WP_041924422.1 WP_041924422.1
  [19] NC_005773.3 17609-17809      - |       NA     gene        NA         0 WP_011167621.1 WP_011167621.1
  [20] NC_005773.3 17806-18438      - |       NA     gene        NA         0 WP_081002664.1 WP_081002664.1
  [21] NC_005773.3 21636-22235      + |       NA     gene        NA         0 WP_002551940.1 WP_002551940.1
                names        gene_id
          <character>    <character>
   [1] WP_011167608.1 WP_011167608.1
   [2] WP_011167609.1 WP_011167609.1
   [3] WP_004643929.1 WP_004643929.1
   [4] WP_004642420.1 WP_004642420.1
   [5] WP_004663805.1 WP_004663805.1
   ...            ...            ...
  [17] WP_011167620.1 WP_011167620.1
  [18] WP_041924422.1 WP_041924422.1
  [19] WP_011167621.1 WP_011167621.1
  [20] WP_081002664.1 WP_081002664.1
  [21] WP_002551940.1 WP_002551940.1
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

> check_input(aastringsetlist, grangeslist)
Error in check_list_names(seq, annotation) : 
  Names of list elements in 'seq' and 'annotation' must match.
almeidasilvaf commented 1 year ago

Hi, @iaindhay

Thank you for using syntenet.

I guess you misunderstood what the internal function check_list_names() does. This function checks if lists (seq and annotation) have the same names. By visual inspection, I can see that names are different. You can confirm that by executing:

> names(aastringsetlist)
[1] "processed_NC_004578_cds_protein_sequences" "processed_NC_005773_cds_protein_sequences"

> names(grangeslist)
[1] "processed_NC_004578_cds_features" "processed_NC_005773_cds_features"

I would suggest renaming them to keep only the 'NC_...' part, or you could give them a better (human-readable) name.

What you are describing in the issue (checking if sequence names in seq match gene IDs in annotation) is another step of the quality control, which is performed by the internal function check_gene_names(). Maybe you got confused there.

If you find that helpful, feel free to close the issue.

Best, Fabricio

iaindhay commented 1 year ago

Thank you for the help. Sorry for the misunderstanding, yes indeed renaming the raw files appropriately seems to have fixed the issue.