TheDataShed / xlsxreader

A low-memory high performance library for reading data from an xlsx file
https://godoc.org/github.com/TheDataShed/xlsxreader
MIT License
74 stars 21 forks source link

Invalid Path for File Name #22

Closed siddhant2408 closed 4 years ago

siddhant2408 commented 4 years ago

Error:

copy excel to csv: open file: File not found: xl//xl/worksheets/sheet1.xml

We're using the relationship file to make the name of the spreadsheet file. Can we not just get the names of the spreadsheets that are present in the root directory of the xml

For the above file, the file listing is:

     docProps/app.xml
     docProps/core.xml
     xl/theme/theme1.xml
     xl/worksheets/sheet1.xml
     xl/styles.xml
     _rels/.rels
     xl/workbook.xml
     xl/_rels/workbook.xml.rels
     [Content_Types].xml

As we can see, xl/worksheets/sheet1.xml is already present in the listing

AndyThurgood commented 4 years ago

The paths you provided as an example do not match, I'm also not quite sure on what you are driving at? Can you provide some replication steps/ improved context please?

siddhant2408 commented 4 years ago

Sample File test3.xlsx

The library is unable to open this file and the error is

copy excel to csv: open file: File not found: xl//xl/worksheets/sheet1.xml

Can you please take a look

siddhant2408 commented 4 years ago

I think there is an issue here https://github.com/TheDataShed/xlsxreader/blob/f13424b56c2d13078ef168ff38fbc0ff53d2fa60/sheets.go#L33

return "xl/" + rel.Target, nil

Maybe rel.Target is already giving us the the file name with xl/ appended. WDYT

siddhant2408 commented 4 years ago

Any ideas?

dglsparsons commented 4 years ago

Could you add a sample file that replicates the issue? I'm not seeing it when playing around, but it could be something specific to the file you have?

dglsparsons commented 4 years ago

From your original post:

We're using the relationship file to make the name of the spreadsheet file. Can we not just get the names of the spreadsheets that are present in the root directory of the xml

That's because listing the files that are the contents kind of breaks the concept of treating an xlsx file as a 'black box'. The relationships file is used to translate the user facing sheet name, into the physical file name on disk. i.e. from image these.

Hope that makes sense?

siddhant2408 commented 4 years ago

I added the sample file in my comment above https://github.com/TheDataShed/xlsxreader/issues/22#issuecomment-684306714

dglsparsons commented 4 years ago

I can't get the same issue with that file you've attached, so I'm not sure how you're causing it?

I'm running the following code:

    e, err := xlsxreader.OpenFile("./test3.xlsx")
    if err != nil {
        fmt.Printf("error: %s \n", err)
        return
    }
    defer e.Close()

    fmt.Printf("Worksheets: %s \n", e.Sheets)

    for row := range e.ReadRows("Aziende") {
        if row.Error != nil {
            fmt.Printf("error on row %d: %s \n", row.Index, row.Error)
            return
        }
    }

Produces the following output:

$ go run main.go
Worksheets: [Aziende] 
error on row 1: Unable to get cell value for cell A1 - no value element found 
dglsparsons commented 4 years ago

I'm also not sure how you've produced that spreadsheet? Because the contents appear to be completely invalid. i.e.

    <row r="1">
      <c r="A1" s="1" t="abcdefGhi">
        <is>
          <t>JKL Mnopq</t>
        </is>
      </c>
      <c r="B1" s="1" t="rstuvwXyz">
        <is>
          <t>Abcdef ghijklm</t>
        </is>
      </c>

The t field on a c element (type of a cell) appears to be complete garbage.

siddhant2408 commented 4 years ago

Yes, we removed the unncessary things from the file. But it got corrupted in the process. Can you please try this file?

test2_onlyheader.xlsx

dglsparsons commented 4 years ago

So again, following code works:

func main() {
    e, err := xlsxreader.OpenFile("./test2_onlyheader.xlsx")
    if err != nil {
        fmt.Printf("error: %s \n", err)
        return
    }
    defer e.Close()

    fmt.Printf("Worksheets: %s \n", e.Sheets)

    for row := range e.ReadRows("Aziende") {
        if row.Error != nil {
            fmt.Printf("error on row %d: %s \n", row.Index, row.Error)
            return
        }

        if row.Index < 10 {
            fmt.Printf("%+v \n", row.Cells)
        }
    }
}

output:

Worksheets: [Aziende] 
[{Column:A Row:1 Value:URL Atoka} {Column:B Row:1 Value:Codice fiscale} {Column:C Row:1 Value:Partita IVA} {Column:D Row:1 Value:Numero REA} {Column:E Row:1 Value:Provincia REA} {Column:F Row:1 Value:Ragione sociale} {Column:G Row:1 Value:Natura giuridica} {Column:H Row:1 Value:Sul mercato da} {Column:I Row:1 Value:Startup innovativa} {Column:J Row:1 Value:Operativa} {Column:K Row:1 Value:Codice ATECO 2007} {Column:L Row:1 Value:Descrizione ATECO 2007} {Column:M Row:1 Value:Codice NACE} {Column:N Row:1 Value:Descrizione NACE} {Column:O Row:1 Value:Indirizzo completo (sede legale)} {Column:P Row:1 Value:C.A.P. (sede legale)} {Column:Q Row:1 Value:Comune (sede legale)} {Column:R Row:1 Value:Provincia (sede legale)} {Column:S Row:1 Value:Regione (sede legale)} {Column:T Row:1 Value:Area geografica (sede legale)} {Column:U Row:1 Value:Numero unità locali (oltre alla sede legale)} {Column:V Row:1 Value:Altre unità locali} {Column:W Row:1 Value:Telefoni verificati} {Column:X Row:1 Value:Telefoni} {Column:Y Row:1 Value:Email verificate (info)} {Column:Z Row:1 Value:Email (info)} {Column:AA Row:1 Value:Email verificate (vendite)} {Column:AB Row:1 Value:Email (vendite)} {Column:AC Row:1 Value:Email verificate (marketing)} {Column:AD Row:1 Value:Email (marketing)} {Column:AE Row:1 Value:Email verificate (acquisti)} {Column:AF Row:1 Value:Email (acquisti)} {Column:AG Row:1 Value:Email verificate (direzione)} {Column:AH Row:1 Value:Email (direzione)} {Column:AI Row:1 Value:Email verificate (assistenza)} {Column:AJ Row:1 Value:Email (assistenza)} {Column:AK Row:1 Value:Email verificate (tecnico)} {Column:AL Row:1 Value:Email (tecnico)} {Column:AM Row:1 Value:Email verificate (amministrazione)} {Column:AN Row:1 Value:Email (amministrazione)} {Column:AO Row:1 Value:Email verificate (altre)} {Column:AP Row:1 Value:Email (altre)} {Column:AQ Row:1 Value:Website} {Column:AR Row:1 Value:Numero dipendenti} {Column:AS Row:1 Value:Ricavi (€)} {Column:AT Row:1 Value:Data chiusura bilancio} {Column:AU Row:1 Value:Quotata} {Column:AV Row:1 Value:Dettagli persone} {Column:AW Row:1 Value:Primo esponente} {Column:AX Row:1 Value:Ruolo primo esponente} {Column:AY Row:1 Value:Secondo esponente} {Column:AZ Row:1 Value:Ruolo secondo esponente} {Column:BA Row:1 Value:Terzo esponente} {Column:BB Row:1 Value:Ruolo terzo esponente} {Column:BC Row:1 Value:Faxes} {Column:BD Row:1 Value:Pec}] 
siddhant2408 commented 4 years ago

Are you sure you tested the exact file without opening and saving the file again. That fixes the file

emilien-puget commented 4 years ago

Hello,

I made this test on a fork here https://github.com/emilien-puget/xlsxreader/blob/b9ae9374ebce9752f70bc248f51c5f7074951024/file_test.go#L36, and this test results in a panic

dglsparsons commented 4 years ago

Considering opening and saving the file fixes the issue, and i don't think having xl/ as a prefix in the relationships is covered in the xlsx specification -> https://www.ecma-international.org/publications/standards/Ecma-376.htm

It looks like the xlsx file is simply corrupt? How has this file been created?

dglsparsons commented 4 years ago

Maybe the other solution would just be that the path provided in the relationships file could be ambiguous, either as a partial (starting after xl/), or starting with xl/, and the library should simply decide whether to add the prefix or not. It's currently unclear to me if it's just that one file that isn't formed properly or a wider problem though.

emilien-puget commented 4 years ago

This isn't an issue just with that one file, it's happening at scale (i have no idea how these xlsx are created), your idea about checking whether the file already contains the start of the path may be a solution

dglsparsons commented 4 years ago

Okay, i'll look at preparing a fix tomorrow.

dglsparsons commented 4 years ago

This should be resolved in the latest release now (#23 closes). Let us know if the issue persists :).

dglsparsons commented 4 years ago

@AndyThurgood can you close this issue?

AndyThurgood commented 4 years ago

Closing...