Closed kbergin closed 4 years ago
➤ Saman Ehsan commented:
I attached what the updated file_format_map.json would look like to fix these issues. Specifically, I added “npz”, “npy” and “csv.gz” file extensions and generalized the pattern for matching zarr files. This should resolve all of the files that have “unknown” file types at the moment.
➤ Saman Ehsan commented:
For reference, here are the logs from a SS2 workflow where you can see most of the zarr files are not matching any pattern in the file_format_map.json: https://00e9e64bac8618b42c41bcb6af9a180ebe18189dbbce03ae39-apidata.googleusercontent.com/download/storage/v1/b/hca-dcp-pipelines-prod-cromwell-execution/o/caas-cromwell-executions%2FAdapterSmartSeq2SingleCell%2Fd8642609-29f2-47fc-b1a5-125776ac10ee%2Fcall-submit%2Fsubmit_wdl.submit%2Fbeeaf3f8-072e-4402-8d4d-26cd72ac7a59%2Fcall-create_submission%2Fstdout?qk=AD5uMEtO2WkdowJ32XWP5T7-mqK4a-9V1sHCBYOHUbz7oAB4gwQXYctzGBGGcug7zHIjcDdK6TxanAMpnN33Y_z-NiQ_wLmjjcKqJ_wVgSJG9loomipbshVg9PK152J3ccJwjMST_wjMRXmQtTTe_n6hrQpBgn0dV_ODbF-PmocMX8Zt_gNocrzAunqohg4xhE2axHFfuScf1qRDUL1qICtfxbiutdkwv_dI8oeDQk6S10Kb6NWbDo3txg_eHRTgceeY_6pRJh1NLrYjtDoEL_aDT3bZ7gwYV5g-dVJa_RoYqFbJH7RLRoAGRCp-na6PYax7AmSh0pxJKCzfQ0MRNYmx0TwH7mkNpvksqVaw-UrB1KbTJuvEmLvPIKwfrRmGlkZMyWg2OXdyuxvD7ZKEoQFgj8DsR1HN6pPMNpX9L2jrvFCZu8uECGq-1bdDnrw758pSvk08i0v6OwDMHXna1EZE5vGmBwSwcvLXgoyLnk0xuGsDboBKmodVGyzdV1bwqJkK8OBV10CS3cYMeu_BnvMM5nhYqH0lIX6uyqcp5ChsMWLqzHH9IUEuZhSglMk04Udv-w8OVILm-6-deqXdETpgXqiehAZdF00M9SD2cjjfvLqvT7UcKYhZeRkYuQcSrUHEU8wAyAAtAKBDOtwXlla1wr5zUoSF-ISWukZZDfZHSpbCBy1Sb_ItcJihJaz2qq1GTmko9fo7FbInQ8d7PW7ogykV9NQRmpMd9PvKBbvZBNNHK4e57mNwwPnq2xesVIgX3gu2k2tvTov2fUvrZFj_QnI1qTH-sJDYAPKXywiIU1OOwMzPUAS4iZd4TdKVDFgz0ubZ_qYbPOcxdA6fENUj3dViL5WBrMUPV3stpCz_c38fbonDMj8CmaA2ki8zG7qlF9Aq08p80lPm0lFdoWl2y5qLTtVY9PEsirw2vjnQXE9m6SLUefEUSR2kEiNyqUrRZ_9IEjhmEjUapBMOp46Y_7xo11XGLlqFskTiJglPQNmbJGt3m2EeI77IiaErM_ckLEIX3ER02KVE2TDqZJEpXJttoBIT0SA4Xn3XJ6KCzlVIvgq1CLheJFFxazfGv8HP1Spsn9dS ( https://00e9e64bac8618b42c41bcb6af9a180ebe18189dbbce03ae39-apidata.googleusercontent.com/download/storage/v1/b/hca-dcp-pipelines-prod-cromwell-execution/o/caas-cromwell-executions%2FAdapterSmartSeq2SingleCell%2Fd8642609-29f2-47fc-b1a5-125776ac10ee%2Fcall-submit%2Fsubmit_wdl.submit%2Fbeeaf3f8-072e-4402-8d4d-26cd72ac7a59%2Fcall-create_submission%2Fstdout?qk=AD5uMEtO2WkdowJ32XWP5T7-mqK4a-9V1sHCBYOHUbz7oAB4gwQXYctzGBGGcug7zHIjcDdK6TxanAMpnN33Y_z-NiQ_wLmjjcKqJ_wVgSJG9loomipbshVg9PK152J3ccJwjMST_wjMRXmQtTTe_n6hrQpBgn0dV_ODbF-PmocMX8Zt_gNocrzAunqohg4xhE2axHFfuScf1qRDUL1qICtfxbiutdkwv_dI8oeDQk6S10Kb6NWbDo3txg_eHRTgceeY_6pRJh1NLrYjtDoEL_aDT3bZ7gwYV5g-dVJa_RoYqFbJH7RLRoAGRCp-na6PYax7AmSh0pxJKCzfQ0MRNYmx0TwH7mkNpvksqVaw-UrB1KbTJuvEmLvPIKwfrRmGlkZMyWg2OXdyuxvD7ZKEoQFgj8DsR1HN6pPMNpX9L2jrvFCZu8uECGq-1bdDnrw758pSvk08i0v6OwDMHXna1EZE5vGmBwSwcvLXgoyLnk0xuGsDboBKmodVGyzdV1bwqJkK8OBV10CS3cYMeu_BnvMM5nhYqH0lIX6uyqcp5ChsMWLqzHH9IUEuZhSglMk04Udv-w8OVILm-6-deqXdETpgXqiehAZdF00M9SD2cjjfvLqvT7UcKYhZeRkYuQcSrUHEU8wAyAAtAKBDOtwXlla1wr5zUoSF-ISWukZZDfZHSpbCBy1Sb_ItcJihJaz2qq1GTmko9fo7FbInQ8d7PW7ogykV9NQRmpMd9PvKBbvZBNNHK4e57mNwwPnq2xesVIgX3gu2k2tvTov2fUvrZFj_QnI1qTH-sJDYAPKXywiIU1OOwMzPUAS4iZd4TdKVDFgz0ubZ_qYbPOcxdA6fENUj3dViL5WBrMUPV3stpCz_c38fbonDMj8CmaA2ki8zG7qlF9Aq08p80lPm0lFdoWl2y5qLTtVY9PEsirw2vjnQXE9m6SLUefEUSR2kEiNyqUrRZ_9IEjhmEjUapBMOp46Y_7xo11XGLlqFskTiJglPQNmbJGt3m2EeI77IiaErM_ckLEIX3ER02KVE2TDqZJEpXJttoBIT0SA4Xn3XJ6KCzlVIvgq1CLheJFFxazfGv8HP1Spsn9dS )
➤ Saman Ehsan commented:
And here is an example from an Optimus workflow: https://00e9e64bac89f559c5c708d6fb73d93840e96c75ef65725485-apidata.googleusercontent.com/download/storage/v1/b/hca-dcp-pipelines-prod-cromwell-execution/o/caas-cromwell-executions%2FAdapterOptimus%2F16e08724-c927-4124-96da-89101d16efa7%2Fcall-submit%2Fsubmit_wdl.submit%2F44aaa242-f157-4c09-888b-d3c386dd6ebf%2Fcall-create_submission%2Fstdout?qk=AD5uMEvpJxYkoV4Lcr8KShEg-_RlxUtvZ8bn6TBBcJXdlhlQx7JCmMiIm8mpsw3qoVUL7nyLhs8kA86Oa16WygIRliEeCp5DVs26_7BOghRsZCNxsNi1FLr07mBdFbFXNr0vL5zaLM-QFbXxeK5n231cDcNhZfeiW4XaqDzkr7nIA6K7db69QT98cDTOC48tG1izwuTNbA3jWQz8O5s6o5mEBvfrUgrdv68L9m5LGgwTT12FOaB4BWwa2W-LE8i6gAasl_QvgUiuNVNiy0ISOEhrujq5p4cBKTo8lA2KuVTodtiZzvrXejPD2Pn8HTOJA-KrDZpjsOIhfTTszmd5wXoCiGfLi3Bs3yGNdGRvDS7Ilj5b8ioZ7950mzAalmFm_qtKEjMgx7awBOHOgOWbcvbUJaFSv_Dkq_NWEz4kWDpRGaYWB80QVZOaL7UQR4Z_yKNa9-rL-GU7l1w1z9O1My2MfDEZzbSw-trOkEOW9ziu2gegXs5oLot7hA3UAENe2kuNnyxJsZQ7Zys40HRLK-UFqHhTTa7YvlaEO97hv1uNjCIH66PbPS_X9-3sc7gwNOWlFiQ0nVsm54J5Hm9I5oarJ-8D-qEQ4Y2nO9YCEOv838H9LEjbpqjj14faF4Mf8NDbVgtccH23a91NoY3Prkxp6xcRAppOK2HQ16_DD51zblYLwSfFmtIrEReHQE0L99gKC_lnQYeuiJhGdgkuVsvFLmgNNEhYzCJEOdb-2H8TfoFPTkilniJuCsj_9zi-1HYVHxZG_WDW0MSCv6zMOEdiF_BOYkhBkW0POtoeGAXB5htXF8EECzGUGEWX2TQX_6-Xw3Lcb4sEepNAA-Yel_6Mc8w69UMB2JO78sTIA_gbWmx8ZRdlBh-L8NEL8hNFLw3QWLFP0MRbdrYdKKbMLI26Ni4d4_8KA23l_8dHkXeye0G0YJgY7pkqhSHrA9f9lA6xy5P9griTC6Caf0Z8XXprxgDHoTOo2BekJPNWzi8Hw8HmWj0k4AKBDptOqUsuYk275n_ERn0y0IIsJVzhQtei0iQfL1RtTi2baprho9ycnlN_MEXZPjw ( https://00e9e64bac89f559c5c708d6fb73d93840e96c75ef65725485-apidata.googleusercontent.com/download/storage/v1/b/hca-dcp-pipelines-prod-cromwell-execution/o/caas-cromwell-executions%2FAdapterOptimus%2F16e08724-c927-4124-96da-89101d16efa7%2Fcall-submit%2Fsubmit_wdl.submit%2F44aaa242-f157-4c09-888b-d3c386dd6ebf%2Fcall-create_submission%2Fstdout?qk=AD5uMEvpJxYkoV4Lcr8KShEg-_RlxUtvZ8bn6TBBcJXdlhlQx7JCmMiIm8mpsw3qoVUL7nyLhs8kA86Oa16WygIRliEeCp5DVs26_7BOghRsZCNxsNi1FLr07mBdFbFXNr0vL5zaLM-QFbXxeK5n231cDcNhZfeiW4XaqDzkr7nIA6K7db69QT98cDTOC48tG1izwuTNbA3jWQz8O5s6o5mEBvfrUgrdv68L9m5LGgwTT12FOaB4BWwa2W-LE8i6gAasl_QvgUiuNVNiy0ISOEhrujq5p4cBKTo8lA2KuVTodtiZzvrXejPD2Pn8HTOJA-KrDZpjsOIhfTTszmd5wXoCiGfLi3Bs3yGNdGRvDS7Ilj5b8ioZ7950mzAalmFm_qtKEjMgx7awBOHOgOWbcvbUJaFSv_Dkq_NWEz4kWDpRGaYWB80QVZOaL7UQR4Z_yKNa9-rL-GU7l1w1z9O1My2MfDEZzbSw-trOkEOW9ziu2gegXs5oLot7hA3UAENe2kuNnyxJsZQ7Zys40HRLK-UFqHhTTa7YvlaEO97hv1uNjCIH66PbPS_X9-3sc7gwNOWlFiQ0nVsm54J5Hm9I5oarJ-8D-qEQ4Y2nO9YCEOv838H9LEjbpqjj14faF4Mf8NDbVgtccH23a91NoY3Prkxp6xcRAppOK2HQ16_DD51zblYLwSfFmtIrEReHQE0L99gKC_lnQYeuiJhGdgkuVsvFLmgNNEhYzCJEOdb-2H8TfoFPTkilniJuCsj_9zi-1HYVHxZG_WDW0MSCv6zMOEdiF_BOYkhBkW0POtoeGAXB5htXF8EECzGUGEWX2TQX_6-Xw3Lcb4sEepNAA-Yel_6Mc8w69UMB2JO78sTIA_gbWmx8ZRdlBh-L8NEL8hNFLw3QWLFP0MRbdrYdKKbMLI26Ni4d4_8KA23l_8dHkXeye0G0YJgY7pkqhSHrA9f9lA6xy5P9griTC6Caf0Z8XXprxgDHoTOo2BekJPNWzi8Hw8HmWj0k4AKBDptOqUsuYk275n_ERn0y0IIsJVzhQtei0iQfL1RtTi2baprho9ycnlN_MEXZPjw )
➤ Chengchen Wang commented:
[~accountid:557058:d066972f-475f-47ee-85b5-245cfa6dc14f] Both of the 2 links are broken unfortunately.
➤ Saman Ehsan commented:
Here is the PR! https://github.com/HumanCellAtlas/pipeline-tools/pull/164 ( https://github.com/HumanCellAtlas/pipeline-tools/pull/164|smart-link )
➤ Saman Ehsan commented:
This is ready to go, but waiting on a fix from Azul for how they check the number of analysis files in a secondary analysis bundle.
➤ Saman Ehsan commented:
The fix from Azul is in their dev environment, and they will let us know when it gets promoted up to production.
➤ Saman Ehsan commented:
The issue has been fixed by Azul!
➤ Saman Ehsan commented:
For QA, check a SS2 and Optimus submission envelope and ensure that there are no “unknown” file types. Note: The file metadata is paginated, and there are links at the end of the page for navigating forward/back to view the rest of the results.
E.g.
SS2: Cromwell workflow: “cede9808-60db-4fa2-bfc8-6aed38e802d9” File metadata in submission envelope: https://api.ingest.integration.data.humancellatlas.org/submissionEnvelopes/5d9e24c223c41d0008bad6ca/files ( https://api.ingest.integration.data.humancellatlas.org/submissionEnvelopes/5d9e24c223c41d0008bad6ca/files )
Optimus:
Cromwell workflow: “e1dc764e-e365-4259-a65c-0285317ef9f6“
File metadata in submission envelope:
https://api.ingest.integration.data.humancellatlas.org/submissionEnvelopes/5d9dd6b923c41d0008bad3dd/files ( https://api.ingest.integration.data.humancellatlas.org/submissionEnvelopes/5d9dd6b923c41d0008bad3dd/files )
➤ Chengchen Wang commented:
QAed!
➤ Saman Ehsan commented:
Thanks!
When submitting analysis file metadata, we refer to a mapping of file extensions to file types and any file extension not in the map is marked as "unknown". Update the mapping to include any missing file extensions.
https://github.com/HumanCellAtlas/pipeline-tools/blob/master/adapter_pipelines/format_map_example.json
┆Issue is synchronized with this Jira Story ┆Attachments: file_format_map.json